In the previous post, I discussed how to start the process of gathering all the information together that you need to understand the system you're troubleshooting.
Now, let's try and understand how to select a metric for measuring the state of the problem you need to troubleshoot.
Describing the Problem:Unfortunately, the first contact with a new problem usually starts with the wonderfully unhelpful statement. "My XYZ site/widget/job is slow"....
SysAdmins/Support people will recognise this as similar to "My computer doesn't work". It doesn't tell you much about the problem!
So the point of this exercise is to:
- Clarify the problem
- Verify the problem and record it in progress
- Narrow down exactly how to measure the problem
Clarify the problem
If we take the very simple LAMP stack environment from my previous post as an example....
On closer questioning, the clients report that the web-site home page starts being very slow, all of a sudden, and that this happens at random intervals, and that the site quickly becomes unusable.
Verify the problem and record it in progressIn this case, it's time to break out tcpdump and wireshark. I'll delve into the details about how to use them to track changes in response time in another post, this post on another blog seems relevant.
Interrogating our imaginary packet capture from in front of the loadbalancer, we see a large number of HTTP requests incoming. They consist of a few static files, and a dynamic request. The response time for the static files seem a little changeable, but the dynamic requests seem hugely variable in response times. Eventually the site dies. Home page requests increase to approximately 20 per second at peak.
You could also verify this by looking at load balancer metrics (if it's smart enough), or by instrumenting the Apache servers to record request time, and analysing the logs (only if you're sure the load-balancer is not part of the problem!). I'll describe how to start instrumenting Apache in a later post.
Narrow down exactly how to measure the problemThe accepted standard for the time a user will wait before getting bored, is about 10 seconds.
If we can sustain the load seen at peak ( 20 home page requests per second ), and keep the full request time under 10 seconds, then our problem has now improved.
If the response time is worse, or very irregular, then we're moving in the wrong direction.
We now have a metric to measure the problem by.