• Testing in Production with real users in real data centers will be a necessity for any high performing large scale software service.
  • Testers will leverage the Cloud to achieve unprecedented effectiveness and productivity.
  • Software development organizations will dramatically change the way they test software and how they organize to assure software quality. This will result in dramatic changes to the testing profession.

Is this really the future? Well, maybe. Any attempt to predict the future will almost certainly be wrong. What we can do is look at trends incurrent changes – whether nascent or well on their way to being established practice – and make some educated guesses.
Here we will cover Testing in Production. The other predictions will be explored in subsequent editions of Testing Planet.

Testing in Production (aka TiP)

Software services such as Gmail, Facebook, and Bing have become an everyday part of the lives of millions of users. They are all considered software services because:

  • Users do not (or do not have to) install desktop applications to use them
  • The software provider controls when upgrades are deployed and features are exposed to users
  • The provider also hasvisibility into the data center running the service, granting access to system data, diagnostics, and even user data subject to privacy policies.

Figure 1. Services benefit from a virtuous cycle which enables responsiveness

Figure 1. Services benefit from a virtuous cycle which enables responsiveness

It is these very features of the service that enable us to TiP. As Figure 1 shows, if software engineers can monitor production then they can detect problems before or contemporaneously to when the first user effects manifest. They can then create and test a remedy to the problem, then deploy it before significant impact from the problem occurs. When we TiP, we are deploying the new and “dangerous” system under test (SUT) to production. The cycle in Figure 1 helps mitigate the risk of this approach by limiting the time users are potentially exposed to problems found in the system under test.

Figure 2. Users are weird - They do unanticipated things

Figure 2. Users are weird - They do unanticipated things

But why TiP? Because our current approach of Big Up-Front Testing (BUFT) in a test lab can only be an attempt to approximate the true complexities of your operating environment. One of our skills as testers is to anticipate the edge cases and understand the environments, but in the big wide world users do things even we cannot anticipate (Figure 2) and data centers are hugely complex systems unto themselves with interactions between servers, networks, power supplies and cooling systems (Figure 3).

TiP however is not about throwing any untested rubbish at users’ feet. We want to control risk while driving improved quality:

Figure 3. Data Centres are complex

Figure 3. Data Centres are complex

  • The virtuous cycle of Figure 1limits user impact by enabling fast response to problems.
  • Up-Front Testing (UFT)is still important – just not “Big” Up-Front Testing (BUFT). Up-front test the right amount – but no more. While there are plenty of scenarios we can test well in a lab, we should not enter the realm of diminishing returns by trying to simulate all of production in the lab. (Figure 4).
  • For some TiP methodologies we can reduce risk by reducing the exposure of the new code under test. This technique is called “Exposure Control” and limits risk by limiting the user base potentially impacted by the new code.

Figure 4. Value spectrum from No Up-Front Testing (UFT) to Big Up-Front Testing (BUFT)

Figure 4. Value spectrum from No Up-Front Testing (UFT) to Big Up-Front Testing (BUFT)

TiP Methodologies

As an emerging trend, TiP is still new and the nomenclature and taxonomy are far from finalized. Butin working with teams at Microsoft, as well as reviewing the publically available literature on practices at other companies, 11 TiP methodologies have been identified (Table 1).

Methodology

Description

Ramped Deployment

Launching new software by first exposing it to subset of users then steadily increasing user exposure.Purpose is to deploy, may include assessment.  Users may be hand-picked or aware they are testing a new system.

Controlled Test Flight

Parallel deployment of new code and old with random unbiased assignment of unaware users to each.  Purpose is to assess quality of new code, then may deploy. May be part of ramped deployment.

Experimentation for Design

Parallel deployment of new user experience with old one.  Former is usually well tested prior to experiment.  Random unbiased assignment of unaware users to each.  Purpose is to assess business impact of new experience.

Dogfood/Beta

User-aware participation in using new code.  Often by invitation.  Feedback may include telemetry, but is often manual/asynchronous.

Synthetic Test in Production

Functional test cases using synthetic data and usually at API level, executing against in-production systems.   “Write once, test anywhere” is preferred: same test can run in test environment and production.  Synthetic tests in production may make use of production monitors/diagnostics to assess pass/fail.

Load/Capacity Test in Production

Injecting synthetic load onto production systems, usually on top of existing real-user load, to assess systems capacity.   Requires careful (often automated) monitoring of SUT and back-off mechanisms

Outside-in load /performance testing

Synthetic load injected at (or close to) same point of origin as user load from distributed sources.  End to End performance, which will include one or more cycles from user to SUT and back to the user again, is measured.

User Scenario Execution

End-to-end user scenarios executed against live production system from (or close to) same point of origin as user-originated scenarios.  Results then assessed for pass/fail.   May also include manual testing.

Data Mining

Test cases search through realuser data looking for specific scenarios.  Those that fail their specified oracle are filed as bugs (sometimes in real-time).

Destructive Testing

Injecting faults into production systems (services, servers, and network) to validate service continuity in the event of a real fault.

Production Validation

Monitors in production check continuously (or on deployment) for file compatibility, connection health, certificate installation and validity, content freshness, etc.

Table 1. TiP Methodologies Defined

Examples of TiP Methodologies in Action

To bring these methodologies to life, let’s delve into some of them with examples.

Experimentation for Design and Controlled Test Flight are both variations of “Controlled Online Experimentation”, sometimes known as “A/B Testing”. Experimentation for Design is the most commonly known whereby changes to the user experience such as different messaging, layout, or controls are launched to a limited number of unsuspecting users, and measurements from both the exposed users and the un-exposed (control) users are collected. These measurements are then analyzed to determine whether the new proposed change is good or not. Both Bing and Google make extensive use of this methodology. Eric Schmidt, former Google CEO reveals, “We do these 1% launches where we float something out and measure that. We can dice and slice in any way you can possibly fathom.”[1]

Controlled Test Flight is almost the same thing, but instead of testing a new user experience the next “dangerous” version of the service is tested versus the tried and true one already in production. Often both methodologies are executed at the same time, assessing both user impact and quality of the new change. For example Facebook looks at not only user behavior (e.g., the percentage of users who engage with a Facebook feature), but also error logs, load andmemory when they roll out codein several stages[2]:

  1. internal release
  2. small external release
  3. full external release

Testing a releaseinternally like this can also be considered part of Dogfood TiP methodology.

Controlled Test Flight can also be enabled via a TiP technique called Shadowing where new code is exposed to users, but users are not exposed to code. An example of this approach was illustrated when Google first tested Google Talk. The presence status indicator presented a challenge for testing as the expected scale was billions of packets per day. Without seeing it or knowing it, users of Orkut (a Google product) triggered presence status changes in back-end servers where engineers could assess the health and quality of that system. This approach also utilized the TiP technique Exposure Control,as initially only 1% of Orkut page views triggered the presence status changes, which was then slowly ramped up[3].

As described, Destructive Testing, which is the killing of services and servers running your production software, might sound like a recipe for disaster. But the random and unexpected occurrence of such faults is a certainty in any service of substantial scale.In one year, Googleexpects to see 20 rack failures, three router failures and 1000s of server failures. So if these failures[4] are sure to occur, it is the tester’s duty to assure the service can handle them when they do.

A good example of such testing is Netflix’s Simian Army. It started with their “Chaos Monkey”, a script deployed to randomly kill instances and services within theirproduction architecture. “The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables.”[5] Then they took the concept further with other jobs with other destructive goals. Latency Monkey induces artificial delays,Conformity Monkey finds instances that don’t adhere to bestpractices and shuts them down, Janitor Monkey searches for unused resources and disposes of them[6].

Synthetic Tests in Production may be more familiar to the tester new to TiP. It would seem to be just running the tests we’ve always run, but against production systems. But in production we need to be careful to limit the impact on actual users. The freedoms we enjoy in the test lab are more restricted in production. Proper Test Data Handling is essential in TiP. Real user data should not be modified, while synthetic data must be identified and handled in such a way as to not contaminate production data. Also unlike the test lab, we cannot depend on “clean” starting points for the systems under test and their environments. The Microsoft Exchange team faced the challenge of copying their large, complex, business class enterprise product to the cloud to run as a service, while continuing to support their enterprise “shrink-wrap” product. For the enterprise product they had 70,000 automated test cases running on a 5,000 machine test lab. Their solution was to:

  • Re-engineer their test automation, adding another level of abstraction to separate the tests from the machine and environment they run on
  • Create a TiP framework running on Microsoft Azure to run the tests

This way the same test can be run in the lab to test the enterprise edition and run in the cloud to test the hosted service version. By leveraging the elasticity of the cloud, the team is able to:

  • Run tests continuously, not just at deployment.
  • Use parallelization to run thousands of tests per run.

 

Share
Related Documents
  1. Mobile Application Testing (14304)
  2. Top 20 practical software testing tips you should read before testing any application. (4550)
  3. Software Testing and Software Development Lifecycles (2951)
  4. Testing Computer Software : Common Software Errors (3040)
  5. Software Testing-Testing Validation (3828)
  6. Applied Software Project Management : Software Testing (1809)
  7. Using Production Grammars in Software Testing (1082)
  8. Software Security Testing (1265)
  9. Client / Server software testing (1052)
  10. Software Testing Life Cycle (7742)
  11. Software Testing Life Cycle (1097)
  12. Software Testing framework (1008)
  13. Introduction to Software Testing Automation Frameworks (2452)
  14. Practical Guide to Software System Testing (1622)
  15. Top 10 Ways to Enhance Your Software Testing Skills (2140)
  16. Software Testing: Functional Testing (2022)
  17. Software Testing and Software Development Lifecycles (1273)
  18. Software Testing Life Cycle (919)
  19. Software Testing Interview Question, (1061)
  20. Project / Software Testing Life Cycle (1116)