Tuesday, 15 January 2013

The basics of infrastructure/app performance troubleshooting.


Hopefully this will be the first in a series of posts trying to demystify the black art of performance testing and analysis on Linux based infrastructure.

There seems to be a lot of confusion around the processes involved, and how you get results. Note, I'm not saying this is the only way to do it, but these methods do get results!

I'm going to go over the basic steps involved first, and then dive into each step in detail in later posts.



  1. Using all available sources of information, create a mental model of how the system you're testing operates, and how the components interact with each other.
  2. Find a metric that shows the problem, and hence shows any improvements or regressions after changes.
  3. Develop performance tests ( synthetic or realistic ) that reliably demonstrates this test metric.
  4. Instrument the infrastructure to collect very high resolution data for various infrastructure metrics, through Network, OS, applications.
  5. Use the information., data collected, and your test results to make a hypothesis about the source of the problem, or bottleneck.
  6. Make a change, to infrastructure or application, to test your hypothesis.
  7. Re run tests, noting differences in performance test results, and changes in Infrastructure metrics.
  8. Use these test results to adjust your understanding of the system, and where the problem is.
  9. Repeat steps 5-8

Overall, the most important thing to remember is to be scientific!

DO:
  • Be ruthless in making sure that the data supports the conclusions you are drawing.
  • Use a configuration management tool like Puppet to record all the changes you're making.

DONT:
  • Leave "test changes" in the system, if they've not helped.
  • Test systems in Production use, if you can possibly avoid it.





No comments:

Post a comment