Performance Driven Development, footnotes and asides

Ever since I wrote up some thoughts on Performance Driven Development (http://ratlperfland.com/performance-driven-development/), I have been treating it like a real thing. It’s reassuring to see it climb towards the top in Google search results, but that may be the result of folks in my organization trying to figure out what I’m talking about.

I can’t lay claim to being the first to string together “performance,” “driven” and “development,” even if I did have a specific intent. Those three words have been grouped together before, sometimes even in the way that I intended. Most prior authors would seem to coalesce around the idea that PDD is about developers creating performant products. However, there are two subtle but distinct camps: 1) Those who think that PDD is achieved through late-phase tooling and testing (in my opinion, this is noble, sometimes meaningful work, but antiquated and inefficient); and 2) Those who envision redefining a whole organization’s development goals by positioning performance design at the center of what they do. Yes, forget features and function: No one will use it if it isn’t fast, and you can’t fix slowness unless you’ve already built-in the ability to measure.

In an article published in 2009, Rajinder Gandotra nicely expresses the latter point, anticipating my definition of PDD: “The need is to move to the next paradigm of software engineering that can be called as ‘performance driven development.’

Performance focus through a performance engineering governance office is suggested which would help govern performance right from the business requirements gathering stage till go-live stage in production.” (SETLabs Briefings, Vol 7, No. 1, 2009, pp. 75-59. Ironically, the easily accessible pdf is neither dated or useful as a citation: http://www.infosys.com/IT-services/application-services/Documents/performance-driven-development.pdf.)

A performance engineering governance office might not imply actually producing instrumented code, but it does suggest end-to-end performance requirements. Given I’m thinking of PDD as part of TDD and BDD, the idea of an office suggests the performance governance is occurring in parallel or from the outside, not within and integral as we might try to do with Agile.

An unsigned article (I gave up looking for an author) http://cargocultsoftwaredevelopment.blogspot.com/2012/05/stuff-that-works-performance-driven.html from May 7, 2012, relates PDD to BDD and TDD as I have done, and proposes iteratively revealing a product’s internal KPIs so that they are exposed and visible, and that the organization can make conscious decisions about what needs improvement or investigation. Sign me up. This goes back to the point that developers will tend to investigate outliers or strange patterns if they are concretely measured or made visible. There’s also a blog post out there (http://trelford.com/blog/post/PDD.aspx, dated the same day) with a nice Max Headroom graphic that summarizes the anonymous author’s points, including create and update a dashboard which exposes KPIs. There is a lot of similarity in what I have been proposing in this post. A whole organization thinking about performance can be pretty powerful, and if a team creates KPI dashboards, that could be the way to ensure that software performance remains foremost.

(Speaking of outliers, what’s the chance that these two items showing the same date are not somehow related?)

Others who have invoked Performance Driven Development seem content to think about applying it in isolation, as a way to improve an individual’s process which in turn would lead (idealistically) to product improvements as a whole. For example, Michael J. A. Smith (http://research.microsoft.com/en-us/events/2007summerschool/michaelsmith.pdf) uses “Performance-Driven Development” as a title to describe a process of improving code through models and simplification. Honestly, this single page pdf is more like a poster illustrating an idea rather than a specific design pattern for creating software where performance considerations are foremost. I admit that I may be missing its point entirely.

And there’s a tool out there, http://www.cubrid.org/wiki_ngrinder/entry/peformance-driven-development-by-ngrinder, whose descriptive materials take the phrase “Performance Driven Development” out for a walk. I am not clear on what ngrinder actually does or how it works.

I can’t lay claim to Performance Driven Development as an original thought. I maintain it is a very compelling way to tackle performance in software. Stay tuned for specific examples that have worked for our organization.

 

Performance Driven Development

Say what you will about Test Driven Development (TDD), but it was an eye opener for some of us the first time we heard about it. Sure practically it’s not always going to happen despite the best of intentions, but it did reframe how some developers went about their work and how some organizations ensured a modest degree or even slightly increased amount of testing earlier in the lifecycle. At an abstract level, thinking about TDD improved quality outcomes regardless of the tactical steps.

Recently, trying to describe the characteristics and skills needed for a new development team, I joked that we needed Performance Driven Development (PDD). I got some light agreement (mostly of the smiling emoticon variety), and at least one “Aha, that’s an awesome idea!” for each “That’s totally stupid. You clearly don’t understand either development or performance.” Yeah, OK. Sometimes good ideas seem stupid at first.

(Of course sometimes stupid ideas seem stupid at first but smart later, but then again, I probably shouldn’t point that out.)

Well, as a thought experiment, Performance Driven Development is a chance to express several qualities I want in the new organization we’re trying to build. We want performance engineering integrated into the product creation experience. Not a throw-it-over-the-wall-afterthought. We do not want a separate performance team, self-organizing, working on a not-current release, using a synthetic workload, and employing custom-made testing tools. Yeah, we’ve all done that. Some of us got really good at it. There’s a lot of work to do in that environment, lots of good problems to find and solve. But it’s frustrating after awhile.

Let’s consider the end result in those cases. Performance really was an afterthought. Code was baked and delivered, and then a team would come along later to “test at scale” or do “load testing.” A necessary, scary bunch of tasks was delayed to the end when it was perhaps to late to do anything about it. Kind of like the way testing was before TDD.

If we’re trying to improve performance, TDD becomes an instructive model. We want the performance specialists (no longer an independent team) to be integrated into the development team, we want them to work on the most current release, we want them to use real customer workloads, and we want the tools they use to be part of the development experience or easy (easy-ish) to use off-the-shelf tools.

Performance Driven Development means that at every step, performance characteristics are noted and then used to make qualitative decisions. This means that there is some light measurement at the unit level, at the component level, etc., and that measurements occur at boundaries where one service or component hands off to another, and that the entire end-to-end user experience is measured as well.

(Skeptics will note that there’s no mention of load, scale or reliability yet, so bear with me.)

The first goal is that the development team must build not just features but the capability to lightly measure. By measure I mean timestamps or durations spent in calls, and where appropriate and relevant, payload size (how much data passed through, and/or at what rate, etc.). By light, I mean really simple logging or measurement which occurs automatically, and which doesn’t require complex debug flags being set in an .ini file or a special recompilation. The goal is that from release to release as features change it will be possible to note whether execution has slowed or speeded, and relatedly whether they are transferring more or less data.

Unit tests would not just indicate { pass / fail }, but when run iteratively { min / avg / mean / max / std } values for call time or response time. Of course we’re talking about atomic transactions, but then again, tasks that are lengthy should be measured as well. (Actually, lengthy tasks demand light measurement to understand what’s occurring underneath, the rate of a single-threaded transformation, or the depth of a queue and how it changes over time, etc.)

I’ve also been talking of performance boundaries. These are the points where different systems interact, where perhaps data crosses an integration point, where a database hands off to a app server, where a mobile client transmits to a web server, or in complex systems where ownership may change. These boundaries need to be understood and documented, and crucially, light measurement must occur at those points (timestamps, payload, etc.).

Finally, the end-to-end experience must be measured. This may align more typically with the majority of performance tools in the market (like LoadRunner and Rational Performance Tester) which record and playback http / https, simulating end-user behavior in a browser. There are many ways to measure the end-to-end user experience

In the PDD context, the performance engineer is responsible for ensuring that light measurement exists at the unit, component and boundary levels. The performance engineer assists the team in building the tools and communal expertise to capture and collect such measurements. The performance engineer leads the charge to identify when units or components become slower or the amount of data that passes through them has increased. The performance engineer will also need to use tools like Selenium, JMeter and Locust to produce repeatable actions that can be measured either of the single-user or multiple-user variety.

I know plenty of really good developers who once you show them conclusive evidence that the routine which they just finished polishing now increases a common transaction by 0.01 second, will go back to the drawing board to get that 0.01 second back and more. They’ll grumble at first, but they’ll be grateful afterwards. Building that type of light measurement can’t be done after the fact. It must be designed from the beginning. And if I’m being idealistic about a super-developer discovering extra time in code, then we all realize that enhancing features intrinsically gets expensive. Adding more features generally makes things slower, that’s a given tradeoff.

And so what about load and reliability and scalability and all those other things? I’ve spent a lot of time working with complex and messy systems. Solving performance problems and building capacity planning models invariably involves defining and building synthetic workloads. I don’t know how many times I’ve been in situations where the tested synthetic workload is nothing at all like what occurs in real life. I have grown convinced that dedicating a team of smart people to devising a monstrous synthetic workload is misdirected. Yes, interesting problems will be discovered, but many of them will not be relevant to real life. Sure this work creates interesting data sheets and documents, but rarely does it yield useful tools for solving problems in the wild.

(You could argue that the performance engineer quickly learns the tools to solve these problems in the wild, which actually is a benefit of having a team doing this work. However, you’re not improving the root cause, you’re actually training a SWAT team to solve problems you’ve made difficult to diagnose to begin with.)

So then how do you simulate real workloads in a test context? I honestly think that’s not the problem to solve anymore. I know this terrifies some people:

“What do you mean you can’t tell me if it will support 100 or 1000 users?!”

I know that some executives will think that without these user numbers the product won’t sell, that customers won’t put money down for something that will not support some imagined load target. (I’ve muttered about how silly it is to measure performance in terms of users elsewhere.)

Given that most of our workloads involve the cloud, or if they don’t, then increasing either virtual or physical hardware capability is relatively inexpensive, then the problem becomes:

“How does overall performance change as I go from 100 to 200 users, and what can monitoring tell me about what that change looks like so I can predict the growth from N to N+100 users?”

Yes you want your car to go fast, but if you mean to keep that car a long time, you learn how it behaves (the sounds, the responsiveness) as it moves from second gear to third gear, or fourth gear to fifth gear.

The problem we need to solve in the performance domain isn’t about how to drive more load (although it can still be fun to knock things over), the problem to solve is:

“How do you build a complex system which reports meaningful measurement and monitoring information, and is also serviceable so that as usage grows, light measurement provides clear indication of how the system is behaving realtime under a real workload.”

This is what I’m trying to get at with the concept of Performance Driven Development. (I won’t lay a claim to creating the phrase. It’s been out in the wild for a while. I’m just putting forth some definitions and concepts which I think it should stand for.)

Field notes: It isn’t always about virtualization, except when it is

Talking recently with some customers, we discussed the fallacy of always trying to solve a new problem with the same method that found the solution of the proceeding problem. Some might recall the adage, “If you only have a hammer, every problem looks like a nail.”

hammer_1

I am often asked to help comprehend complex performance problems our customers encounter. I don’t always have the right answer, and usually by the time the problem gets to me, a lot of good folks who are trained in solving problems have spent a lot of time trying to sort things out. I can generally be counted upon for a big-picture perspective, some non-product ideas and a few other names of folks to ask once I’ve proved to be of no help.

A recent problem appeared to have no clear solution. The problem was easy to repeat and thus demonstrable. Logs didn’t look out of the ordinary. Servers didn’t appear to be under load. Yet transactions that should be fast, say well under 10 seconds, were taking on the order of minutes. Some hands-on testing had determined that slowness decreased proportionally with the number of users attempting to do work (a single user executing a task took 30-60 seconds, two users at the same time took 1 to 90 seconds, three users took 2-3 minutes, etc.).

So I asked whether the environment used virtualized infrastructure, and if so, could we take a peek at the settings.

Yes, the environment was virtualized. No, they hadn’t looked into that yet. But yes, they would. It would take a day or too to reach the folks who could answer those questions and explain to them why we were asking.

But we never did get to ask them those questions. Their virtualization folks took a peek at the environment and discovered that the entire configuration of five servers was sharing the processing power customarily allocated to a single server. All five servers were sharing 4 GHz of processing power. They increased the resource pool to 60 GHz and the problem evaporated.

I can’t take credit for solving that one. It was simply a matter of time before someone else would have stepped back and asked the same questions. However, I did write it up for the deployment wiki. And I got to mention it here.

Be even smarter with virtualization

It took a bit of unplanned procrastination, but we finally got to the second and third parts of our in-depth investigation of virtualization as it relates to IBM Rational products.

casestudy3snip

Part two is now published here as Be smart with virtualization: Part 2. Best practices with IBM Rational Software.

Part three lives on the deployment wiki here as Troubleshooting problems in virtualized environments.

Part two presents two further case studies and a recap of the principles explored in Part 1. We took a stab at presenting the tradeoffs between different virtualization configurations. Virtualization is becoming more prevalent because it’s a powerful way to manage resources and squeeze efficiencies from hardware. Of course there are balances and Part two goes a bit deeper.

Part three moves to the deployment wiki and offers some specific situations we’ve solved in our labs and with customers. There are also screen shots of one of the main vendor’s tools which can guide you to identify your own settings.

Our changing audience

Really, who are you?
Not yet introducing the new jazz deployment wiki

In our corner of the world, some of us on the Jazz Jumpstart universe are wondering who will spill the beans and mention the new jazz Deployment wiki first. I don’t think it will be me.

We’re all working on a new way for the Jazz ecosystem to present information, specifically deployment information. Not just “Insert tab A into slot B” types of material, but the more opinionated, specific stuff you’ve told us you want to hear. We have folks working on Monitoring, Integrating, Install and Upgrade, and other deployment topics. I own the Performance Troubleshooting section.

When the actual wiki rolls out (and I can actually talk about it), I’ll talk about some of the structure and design questions we wrestled with. For now I want to talk about one of the reasons why we’re presenting information differently, and that’s because we think our audience has changed.

IT used to be simple

Ok, so may IT was never actually that simple, but it was certainly a lot easier to figure out what to do. One of IBM Rational’s strengths is that we’ve built strong relationships with our customers over the years. Personally, a lot of the customers I know (and who I think know me) started out as ClearCase or ClearQuest admins and over time have evolved now to Jazz/CLM admins. Back when, there was pretty much a direct relationship with our product admins, who in turn knew their end users and had ownership of their hardware environments.

This picture describes what I’m talking about (they’re from a slide deck we built in 2011 to talk about virtualization some of which lives on elsewhere, but these pics are too good to abandon):

aud_olditmod

The relationship between Rational Support / Development and our customers remains strong and direct. Over the years it’s the context of our customers product administrators that has shifted in many cases:

aud_newsysmod

Consolidation, regulation, governance, compliance, etc., have all created additional IT domains which are often outside the customers’ product administration. There are cases where our relationship with our customers’ product administrators remains strong but we’ve lost sight of their context.

Here’s another way to look at the old model, this is specifically around hardware ownership:

aud_oldmod2

Back in the day, our customers’ product administrators would request hardware, say a Solaris box (yes, I am talking about many years ago…), the hardware would arrive and the Rational product admin would get root privileges and start the installation. Nowadays, the hardware might be a VM, and there might be all sorts of settings which the admin can’t control such as security, database, or as is pertinent to this example, VMs.

aud_newmod2

 

This is a long winded way to say that we’re well aware we have multiple audiences, and need to remember that product administrators and IT administrators may no longer be the same people. Loving a product and managing how it’s used isn’t quite the same as it used to be. We’re trying to get better at getting useful information out there which is one of the reasons for the new deployment wiki.

 

Virtualization demystified

Read about Rational’s perspective on virtualization over at IBM developerWorks

For the IBM Innovate 2011 conference, the Rational Performance Engineering team presented some of its research on virtualization. We had an accompanying slide deck too, and called it the Rational Virtualization Handbook.

It’s taken a bit of time, but we have finally fleshed out the slides and written a proper article.

Actually, the article has stretched into two parts, the first of which lives at Be smart with virtualization. Part 1. Best practices with IBM Rational software. Part 2 is in progress and will contain further examples and some troubleshooting suggestions. I can’t say for sure, but we have a topic lined up which would make a third part, but there’s a lot of work ahead.

I’m tempted to repost large excerpts because I’m proud of the work the team did. And it took a bit longer than expected to convert slideware into a real article, and so the article took a lot of work. I won’t give away the secrets here…. You’ll have to check out IBM developerWorks yourself. However, let me kickstart things with a history sidebar:

A brief history of virtualization

Despite its emergence as a compelling, necessary technology in the past few years, server virtualization has actually been around for quite some time. In the 1970s, IBM introduced hypervisor technology in the System z and System i® product lines. Logical partitions (LPARs) became possible on System p® in 2000. The advent of virtual machines on System x and Intel-based x86 hardware was possible as early as 1999. In just the last few years, virtualization has become essential and inevitable in Microsoft Windows and Linux environments.

What products are supported in virtualized environments?

Very often we explain that asking whether a particular Rational product is supported with virtualization isn’t actually the right question. Yes, we’ve run on Power hardware and lpars for several years now. Admittedly KVM and VMware are newer to the scene. Some may recall how clock drift could really mess things up, but those problems seem to be behind us.

The question isn’t whether Rational products are supported on a particular flavor of virtualization: If we support a particular Windows OS or Linux OS, then we support that OS whether it’s physical or virtualized.

Virtualization is everywhere

Starting in 2010 at Innovate and other events, we routinely asked folks in the audience whether they were aware their organizations were using virtualization (the platform didn’t matter). In 2010 and 2011 we got a few hands, maybe two or three in a room of 20. Folks were asking us if virtualization was supported. Was it safe? Could they use it? What were our suggestions?

Two years later, in 2012, that ratio was reversed: Nearly every hand in the audience shot up. We got knowing looks from folks who had disaster stories of badly managed VMs. There were quite a few people who had figured out how to manage virtualization successfully. There were questions from folks looking for evidence and our suggestions to take back to their IT folks.

Well, finally, we have something a bit more detailed in print.