Performance Driven Development, footnotes and asides

Ever since I wrote up some thoughts on Performance Driven Development (http://ratlperfland.com/performance-driven-development/), I have been treating it like a real thing. It’s reassuring to see it climb towards the top in Google search results, but that may be the result of folks in my organization trying to figure out what I’m talking about.

I can’t lay claim to being the first to string together “performance,” “driven” and “development,” even if I did have a specific intent. Those three words have been grouped together before, sometimes even in the way that I intended. Most prior authors would seem to coalesce around the idea that PDD is about developers creating performant products. However, there are two subtle but distinct camps: 1) Those who think that PDD is achieved through late-phase tooling and testing (in my opinion, this is noble, sometimes meaningful work, but antiquated and inefficient); and 2) Those who envision redefining a whole organization’s development goals by positioning performance design at the center of what they do. Yes, forget features and function: No one will use it if it isn’t fast, and you can’t fix slowness unless you’ve already built-in the ability to measure.

In an article published in 2009, Rajinder Gandotra nicely expresses the latter point, anticipating my definition of PDD: “The need is to move to the next paradigm of software engineering that can be called as ‘performance driven development.’

Performance focus through a performance engineering governance office is suggested which would help govern performance right from the business requirements gathering stage till go-live stage in production.” (SETLabs Briefings, Vol 7, No. 1, 2009, pp. 75-59. Ironically, the easily accessible pdf is neither dated or useful as a citation: http://www.infosys.com/IT-services/application-services/Documents/performance-driven-development.pdf.)

A performance engineering governance office might not imply actually producing instrumented code, but it does suggest end-to-end performance requirements. Given I’m thinking of PDD as part of TDD and BDD, the idea of an office suggests the performance governance is occurring in parallel or from the outside, not within and integral as we might try to do with Agile.

An unsigned article (I gave up looking for an author) http://cargocultsoftwaredevelopment.blogspot.com/2012/05/stuff-that-works-performance-driven.html from May 7, 2012, relates PDD to BDD and TDD as I have done, and proposes iteratively revealing a product’s internal KPIs so that they are exposed and visible, and that the organization can make conscious decisions about what needs improvement or investigation. Sign me up. This goes back to the point that developers will tend to investigate outliers or strange patterns if they are concretely measured or made visible. There’s also a blog post out there (http://trelford.com/blog/post/PDD.aspx, dated the same day) with a nice Max Headroom graphic that summarizes the anonymous author’s points, including create and update a dashboard which exposes KPIs. There is a lot of similarity in what I have been proposing in this post. A whole organization thinking about performance can be pretty powerful, and if a team creates KPI dashboards, that could be the way to ensure that software performance remains foremost.

(Speaking of outliers, what’s the chance that these two items showing the same date are not somehow related?)

Others who have invoked Performance Driven Development seem content to think about applying it in isolation, as a way to improve an individual’s process which in turn would lead (idealistically) to product improvements as a whole. For example, Michael J. A. Smith (http://research.microsoft.com/en-us/events/2007summerschool/michaelsmith.pdf) uses “Performance-Driven Development” as a title to describe a process of improving code through models and simplification. Honestly, this single page pdf is more like a poster illustrating an idea rather than a specific design pattern for creating software where performance considerations are foremost. I admit that I may be missing its point entirely.

And there’s a tool out there, http://www.cubrid.org/wiki_ngrinder/entry/peformance-driven-development-by-ngrinder, whose descriptive materials take the phrase “Performance Driven Development” out for a walk. I am not clear on what ngrinder actually does or how it works.

I can’t lay claim to Performance Driven Development as an original thought. I maintain it is a very compelling way to tackle performance in software. Stay tuned for specific examples that have worked for our organization.

 

Performance Driven Development

Say what you will about Test Driven Development (TDD), but it was an eye opener for some of us the first time we heard about it. Sure practically it’s not always going to happen despite the best of intentions, but it did reframe how some developers went about their work and how some organizations ensured a modest degree or even slightly increased amount of testing earlier in the lifecycle. At an abstract level, thinking about TDD improved quality outcomes regardless of the tactical steps.

Recently, trying to describe the characteristics and skills needed for a new development team, I joked that we needed Performance Driven Development (PDD). I got some light agreement (mostly of the smiling emoticon variety), and at least one “Aha, that’s an awesome idea!” for each “That’s totally stupid. You clearly don’t understand either development or performance.” Yeah, OK. Sometimes good ideas seem stupid at first.

(Of course sometimes stupid ideas seem stupid at first but smart later, but then again, I probably shouldn’t point that out.)

Well, as a thought experiment, Performance Driven Development is a chance to express several qualities I want in the new organization we’re trying to build. We want performance engineering integrated into the product creation experience. Not a throw-it-over-the-wall-afterthought. We do not want a separate performance team, self-organizing, working on a not-current release, using a synthetic workload, and employing custom-made testing tools. Yeah, we’ve all done that. Some of us got really good at it. There’s a lot of work to do in that environment, lots of good problems to find and solve. But it’s frustrating after awhile.

Let’s consider the end result in those cases. Performance really was an afterthought. Code was baked and delivered, and then a team would come along later to “test at scale” or do “load testing.” A necessary, scary bunch of tasks was delayed to the end when it was perhaps to late to do anything about it. Kind of like the way testing was before TDD.

If we’re trying to improve performance, TDD becomes an instructive model. We want the performance specialists (no longer an independent team) to be integrated into the development team, we want them to work on the most current release, we want them to use real customer workloads, and we want the tools they use to be part of the development experience or easy (easy-ish) to use off-the-shelf tools.

Performance Driven Development means that at every step, performance characteristics are noted and then used to make qualitative decisions. This means that there is some light measurement at the unit level, at the component level, etc., and that measurements occur at boundaries where one service or component hands off to another, and that the entire end-to-end user experience is measured as well.

(Skeptics will note that there’s no mention of load, scale or reliability yet, so bear with me.)

The first goal is that the development team must build not just features but the capability to lightly measure. By measure I mean timestamps or durations spent in calls, and where appropriate and relevant, payload size (how much data passed through, and/or at what rate, etc.). By light, I mean really simple logging or measurement which occurs automatically, and which doesn’t require complex debug flags being set in an .ini file or a special recompilation. The goal is that from release to release as features change it will be possible to note whether execution has slowed or speeded, and relatedly whether they are transferring more or less data.

Unit tests would not just indicate { pass / fail }, but when run iteratively { min / avg / mean / max / std } values for call time or response time. Of course we’re talking about atomic transactions, but then again, tasks that are lengthy should be measured as well. (Actually, lengthy tasks demand light measurement to understand what’s occurring underneath, the rate of a single-threaded transformation, or the depth of a queue and how it changes over time, etc.)

I’ve also been talking of performance boundaries. These are the points where different systems interact, where perhaps data crosses an integration point, where a database hands off to a app server, where a mobile client transmits to a web server, or in complex systems where ownership may change. These boundaries need to be understood and documented, and crucially, light measurement must occur at those points (timestamps, payload, etc.).

Finally, the end-to-end experience must be measured. This may align more typically with the majority of performance tools in the market (like LoadRunner and Rational Performance Tester) which record and playback http / https, simulating end-user behavior in a browser. There are many ways to measure the end-to-end user experience

In the PDD context, the performance engineer is responsible for ensuring that light measurement exists at the unit, component and boundary levels. The performance engineer assists the team in building the tools and communal expertise to capture and collect such measurements. The performance engineer leads the charge to identify when units or components become slower or the amount of data that passes through them has increased. The performance engineer will also need to use tools like Selenium, JMeter and Locust to produce repeatable actions that can be measured either of the single-user or multiple-user variety.

I know plenty of really good developers who once you show them conclusive evidence that the routine which they just finished polishing now increases a common transaction by 0.01 second, will go back to the drawing board to get that 0.01 second back and more. They’ll grumble at first, but they’ll be grateful afterwards. Building that type of light measurement can’t be done after the fact. It must be designed from the beginning. And if I’m being idealistic about a super-developer discovering extra time in code, then we all realize that enhancing features intrinsically gets expensive. Adding more features generally makes things slower, that’s a given tradeoff.

And so what about load and reliability and scalability and all those other things? I’ve spent a lot of time working with complex and messy systems. Solving performance problems and building capacity planning models invariably involves defining and building synthetic workloads. I don’t know how many times I’ve been in situations where the tested synthetic workload is nothing at all like what occurs in real life. I have grown convinced that dedicating a team of smart people to devising a monstrous synthetic workload is misdirected. Yes, interesting problems will be discovered, but many of them will not be relevant to real life. Sure this work creates interesting data sheets and documents, but rarely does it yield useful tools for solving problems in the wild.

(You could argue that the performance engineer quickly learns the tools to solve these problems in the wild, which actually is a benefit of having a team doing this work. However, you’re not improving the root cause, you’re actually training a SWAT team to solve problems you’ve made difficult to diagnose to begin with.)

So then how do you simulate real workloads in a test context? I honestly think that’s not the problem to solve anymore. I know this terrifies some people:

“What do you mean you can’t tell me if it will support 100 or 1000 users?!”

I know that some executives will think that without these user numbers the product won’t sell, that customers won’t put money down for something that will not support some imagined load target. (I’ve muttered about how silly it is to measure performance in terms of users elsewhere.)

Given that most of our workloads involve the cloud, or if they don’t, then increasing either virtual or physical hardware capability is relatively inexpensive, then the problem becomes:

“How does overall performance change as I go from 100 to 200 users, and what can monitoring tell me about what that change looks like so I can predict the growth from N to N+100 users?”

Yes you want your car to go fast, but if you mean to keep that car a long time, you learn how it behaves (the sounds, the responsiveness) as it moves from second gear to third gear, or fourth gear to fifth gear.

The problem we need to solve in the performance domain isn’t about how to drive more load (although it can still be fun to knock things over), the problem to solve is:

“How do you build a complex system which reports meaningful measurement and monitoring information, and is also serviceable so that as usage grows, light measurement provides clear indication of how the system is behaving realtime under a real workload.”

This is what I’m trying to get at with the concept of Performance Driven Development. (I won’t lay a claim to creating the phrase. It’s been out in the wild for a while. I’m just putting forth some definitions and concepts which I think it should stand for.)

Two new CLM 5.0 datasheets on the jazz.net Deployment wiki

A quick note about two recent performance datasheets and sizing guidelines published on the Jazz.net Deployment wiki:

Just posted is the Rational Engineering Lifecycle Manager (RELM) performance report 5.0 by Aanjan Hari. This article presents the results of the team’s performance testing for the RELM 5.0 release and deployment suggestions derived from the tests.

snap1099

Also recently posted: Sizing and tuning guide for Rational DOORS Next Generation 5.0 by Balakrishna Kolla. This article offers guidance based upon the results of performance tests that the team ran on several hardware and software configurations.

 

Software sizing isn’t easy

I’m going to quote pretty much the entirety of an introduction I wrote to an article just posted at the jazz.net Deployment wiki on CLM Sizing (https://jazz.net/wiki/bin/view/Deployment/CLMSizingStrategy):

Whether new users or seasoned experts, customers using IBM Jazz products all want the same thing: They want to use the Jazz products without worrying that their deployment implementation will slow them down, that it will keep up with them as they add users and grow. A frequent question we hear, whether it’s from a new administrator setting up Collaborative Lifecycle Management (CLM) for the first time, or an experienced administrator tuning their Systems and Software Engineering (SSE) toolset, is “How many users will my environment support?”

Back when Rational Team Concert (RTC) was in its infancy we built a comprehensive performance test environment based on what we thought was a representative workload. It was in fact based upon the workload the RTC and Jazz teams itself used to develop the product. We published what we learned in our first Sizing Guide. Later sizing guides include: Collaborative Lifecycle Management 2011 Sizing Guide and Collaborative Lifecycle Management 2012 Sizing Report (Standard Topology E1). As features were added and the release grew, we started to hear about what folks were doing in the field. The Jazz products, RTC especially, are so flexible that customers were using them with wonderfully different workloads than we had anticipated.

Consequently, we stepped back from proclaiming a one-size fits all approach, and moved to presenting case studies and specific test reports about the user workload simulations and the loads we tested. We have published these reports on the jazz.net Deployment wiki at Performance datasheets. We have tried to make a distinction between performance reports and sizing guides. Performance reports document a specific test with defined hardware, datashape and workload, whereas sizing guides suggest patterns or categories of hardware, datashape and workload. Sizing reports are not specific and general descriptions of topologies and estimations of workloads they may support.

Throughout the many 4.0.x release cycles, we were still asked “How many users will my environment support?” Our reluctance to answer this apparently straightforward question frustrated customers new and old. Everyone thinks that as the Jazz experts we should know how to size our products. Finally, after some analysis and messing up countless whiteboards, we would like to present some sizing strategies and advice for the front-end applications in the Jazz platform: Rational Team Concert (RTC), Rational Requirements Composer (RRC)/Rational DOORS Next Generation (DNG) and Rational Quality Manager (RQM). These recommendations are based upon our product testing and analysis of customer deployments.

sizingestact_wide

The article talks about how complex estimating a software sizing can be. Besides the obligatory disclaimer, there’s a pointer to the CLM and SEE recommended topologies and a discussion of basic definitions. There’s also a table listing many of the non-product (or non-functional) factors which can wreck havoc with the ideal performance of a software deployment.

Most importantly, the article provides some user sizing basics for Rational Team Concert (RTC), Rational Requirements Composer (RRC)/Rational DOORS Next Generation (DNG) and Rational Quality Manager (RQM). Eventually we’ll talk a bit more about the strategies / concepts needed to determine whether you may need two CCMs or multiple application servers in your environment.

For now, I hope we’re taking a good step towards answering the perennial question: “How many users will my environment support,” and explaining why it’s so hard to answer that question accurately.

As always, comments and questions are appreciated.

Field notes: Measuring capacity, users and rates

Picture2

Small, Medium and Large

A frequent question concerns how we might characterize a company’s CLM deployment. Small, medium and large are great for tee-shirts, but not for enterprise software deployments. I admit we tried to use such sizing buckets at one point, but everyone said they were extra-large.

Sometime we characterize a deployment’s size by the number of users that might interact with it. We try to distinguish between named and concurrent users.

Named users are all the folks permitted use a product. These are registered users or users with licenses. This might be everyone in a division or all the folks on a project.

Concurrent users are the folks who actually are logged in and working at any given time. These are the folks who aren’t on vacation, but actually doing work like modifying requirements, checking in code, or toying with a report. Concurrent users are a subset of named users.

Generally we’ve seen the concurrent number of users hover around 15 to 25% of the named users. The percentage is closer to 15% in a global company whose users span time zones, and closer to 25% in a company where everyone is in one time zone.

As important is it is to know how many users might interact with a system, user numbers aren’t always an accurate way to measure a system’s capacity over time. “My deployment supports 3000 users” feels like a useful characterization, but it can be misleading because no two users are the same.

Because it can lead to simple answers, I cringe whenever someone asks me of a particular application or deployment, “How many users does it support?” I know there’s often no easy way to characterize systems, and confess I often ask countless questions and end up providing simple numbers. (I’ve tried answering “It supports one user at a time,” but that didn’t go over so well.)

Magic Numbers

How many users does it support” is a Magic Number Question, encouraged by Product Managers and abetted by Marketing, because it’s simple and easy to print on the side of the box. Magic Numbers are especially frequent in web-applications and website development. I’ve been on death marches mature software development projects where the goal was simply to hit a Magic Number so we could beat a competing product’s support statement and affix gold starbursts to the packaging proclaiming “We support 500 users.” (It didn’t matter if performance sucked, it was only important to cram all those sardines into the can.)

Let’s look at the Magic Number’s flaws. No two users do the same amount of work. The “500 users” is a misleading aggregate as one company may have 500 lazy employees where another might have caffeinated power-users running automated scripts and batch jobs. And even within the lazy company, no two users work at the same rate.

Ideally we’d use a unit of measure to describe work. A product might complete “500 Romitellis” in an hour, and we could translate end-user actions to this unit of measure. Creating a new defect might create 1 Romitelli, executing a small query might be 3 Romitellis, a large query could be 10 Romitellis. But even this model is flawed as one user might enter a defect with the least amount of data, whereas another user might offer a novel-length description. The difference in data size might trigger extra server work. This method also doesn’t account for varied business logic which might require more database activity for one project and less for another.

Rates

Just as a set number of users is interesting but insufficient, a simple counter to denote usage doesn’t helpfully describe a system’s capacity. We need a combination of the two, or a rate. Rate is a two-part measurement describing how much of an action occurred in how much time. But as essential as rates are, it’s important to realize how rates can be misunderstood.

Consider this statement:

“I drank six glasses of wine.”

On its own, that may not seem so remarkable. But you may get the wrong impression of me. Suppose I say:

“I drank six glasses of wine in one night.

Now you might have reason to worry. My wife would definitely be upset. And suppose I say:

“I drank six glasses of wine in the month of August.”

That would give a completely different impression. This is where rate comes into play. The first rate is 6-per-day the second is 6-per-month. The first relative to the second is roughly 30 times greater. One rate is more likely lead to disease and family disharmony than the other.

Let’s move this discussion to product sizing. Consider this statement:

“The bank completes 200 transactions.”

It’s just as unhelpful as the side-of-the-box legend, “The bank supports 200 users.” For this statement to be valuable it needs to be stated in a rate:

“The bank completes 200 transactions in a day.

This seems reasonable at first glance, as it suggests a certain level of capability. But now we can offer another:

“The bank completes 200 transactions in a second.

And now the first rate comes into perspective. Realize that rates can have some degree of meaningful variance. When a system is usable and working, it rarely functions at a consistent rate. Daily peaks and valleys are normal. Some months are busier than others. We may have an average rate, but we probably also need to specify an extreme rate:

“The bank completes 200 transactions in an hour on payday just before closing.

Or even a less-than-average rate:

“The bank completes 20 transactions in an hour on a rainy Tuesday morning in August.”

Imagine we’re testing an Internet website designed to sell a popular gift. The transaction rate in the middle of May should be different than the transaction rate in the weeks between Thanksgiving and Christmas. Therefore we should try to articulate capacity at both average transaction rate and peak transaction rate.

TPH

There’s no perfect solution for measuring user capacity. What we like to do is to describe work in units of transactions-per-hour-per-user. This somewhat gets beyond differences in users by characterizing work in terms of transactions, and also averages different types of users and data sizes to create a basic unit. Of course we could discuss all day what a transaction means. For now, take it to mean a self-contained action in a product (login, create an artifact, run a query, etc.), For much of our tests, we determined an average rate was 15 transactions-per-hour-per-user. (I abbreviate it 15 tph/u.) A peak rate was three times as much or 45 tph/u.

15 tph/u may not seem like very much activity. It’s approximately one transaction every four minutes. But we looked at customer logs, we looked at our logs, and this is the number we came up with as an average transaction rate for some of our products. Imagine that within an hour you may complete 15 transactions and practically you may complete more at the top of the hour, get distracted and complete fewer as the hour completes. Also, 15 is a convenient number for multiplying and relating to an hour. (Thank the Babylonians for putting 60 minutes into an hour.) Increasing the 15 tph/u rate four times means something happens once every minute.

Returning back to the “How many users” question, it’s usually better to answer it in terms of transactions-per-hour: “We support 100 users working at an average rate of 15 transactions-per-hour-per-user on average and 75 transactions-per-hour-per-user at peak periods.” Such statements can make Marketing glaze over, but they’re generally more accurate. Smart companies can look at that statement and say, “In our super-busy company, that’s only 50 users working at 30 transactions-per-hour-per-user” on average and so forth.

When might you see us talking in these terms with such attention to rates? We’re working on it. We want to get it right.