Tuesday, 23 June 2015

Asynchronous Cloud bootstrapping with Terraform, Cloud-Init & Puppet


Asynchronous Cloud bootstrapping with Terraform, Cloud-Init & Puppet

How does your organisation deploy cloud environments? Is it completely automated? Is it completely "Infrastructure as Code"?

Working with OpenCredo clients, I've noticed that even if you are one of the few organisations that can boast 'Infrastructure as Code', perhaps it's only true of your VMs, and likely you have 'bootstrap problems'. What I mean by this, is that you require some cloud-infrastructure to already be in place before your VM automation can go to work.

I've seen a lot of AWS (Openstack, Google Compute too ) environments where objects like VPCs, subnets, routing etc. is all put in by hand - or maybe with some static scripts. A little beachhead manually deployed inside your cloud-provider, all before the VMs are spun up and provisioned automagically by your tools.

Wouldn't it be better if your Cloud networking and associated configuration was deployed at the same time, and dynamically linked to the VMs being provisioned?

Even better - what if scaling out everything across A-Zs, or changing machine types across your estate was just a configuration parameter? How about a second copy of everything for DR or staging?

Zero-to-cloud in one key press.

It's been possible to do this for a while. A few options include.
  • Amazon Cloud Formation - useful, but locks you in to AWS
  • Fog - very powerful, but tough to use, every Cloud provider interface has it's own idiosyncrasies. eg. AWS' eventual consistency!
  • Low level cloud cli tools eg. aws-cli, gcloud-cli - results tend to be very deployment-specific - hard to re-use the scripts for new environments.
  • Proprietary eg. Rightscale

Terraform

Terraform is the tool I've been waiting for for years. It's only been going a year or so, but has come along leaps and bounds in the last 6 months, and as ever, Hashicorp are very responsive on Github to feature requests, bugs and pull requests.

It has the ability to do complete cloud-infrastructure creation, and through it's DSL link the various components together programatically.

This definition of useful parts of cloud-infrastructure can be defined as a module, with parameterised inputs - and used again and again in different deployments, in combination with other modules.

Why is this powerful? A full worked example of Terraform using these capabilities would be a set of  blog posts by themselves, but using AWS as an example, let's look at an overview of what you could do:

Management module - Define a VPC, subnet, NAT box, appropriate routing, security groups,  and a Puppetmaster.

Front-end web-server module - Define a tier of 5 front-end webservers in their own subnet - link them dynamically to a public ELB.

Middleware web-server module - Define a tier of 5 messaging brokers and 5 custom middleware servers - link them to an ELB

DB layer module - Define a tier of 5 MongoDB servers

All modules -  Automatically create DNS records for all infrastructure. Consume security groups from the Management module. Parameterise the module input to create this across A-Zs on demand, including sizing and server numbers.

The Terraform code that calls these modules can pass details between them, so for instance an ELB in one module can be made aware of an instance ID created in another module.

Now we have modules that can be sized, re-used and linked-together as appropriate, we can define environments using them. The same provisioning code can be used in Test, Staging and Production - with suitable input parameters.

Cloud-Init

While Terraform can directly call hooks for provisioning tools, ( eg. Puppet or Chef. ) usually the VMs it creates are in private networks, not accessible from the internet, or the machine running Terraform.

The problem becomes, how to pass data to VMs for use in initial bootstrapping, where you can't contact the VMs themselves?

Cloud-Init uses Instance Metadata ( a free-form text field associated with a VM), retrieved over a private internal cloud API, to do first-boot configuration. Most cloud providers have this metadata concept, including Openstack and Google Compute.

Examples of templated information passed from Terraform to Cloud-Init via instance metadata:
eg.
What's my hostname ?
Find, format, mount secondary disks.
Where is my config management service?

It's important to note that Cloud-Init is not a configuration management tool - here it's merely a thin shim between Terraform and proper configuration management.

Puppet

Why Puppet? 

It's true there are a few other tools we could substitute here, but the important thing is to have a 'pull' based deployment model.

Why is this necessary?

As mentioned before, we cannot directly contact the VMs being spun up in private networks, a pull-based model allows the VMs themselves to pull their configuration when they're ready.

We could do master-less puppet, with each VM doing a Github checkout of puppet-code, and locally applying it, but this relies on external code repositories, and puts a full copy of secrets on each VM, both are security risks.

Self contained, Fire and Forget

Cloud deployments made using this stack of tools are truly self contained, fire-and-forget.

The only connectivity requirement the provisioning user/service needs is to the Cloud Provider API.

Run Terraform, wait for it to complete in a few minutes - then the asynchronous provisioning takes over as the VMs boot to bring your infrastructure into service.

Friday, 28 November 2014

Cloud Paranoia? - Cloud Foundry, Confidential data, Public clouds.


I've been working with Cloud Foundry (CF) & BOSH for close to two years now. One repeating theme I've seen is customers asking for more and more protection for their data and network traffic, especially when CF is running in third party Cloud Infrastructure environments outside of their direct control.

There are two security scenarios I want to focus on, but this applies to almost all CF components, in both Public and Private infrastructure


Scenario 1 - Sensitive CF Application traffic.


CFs primary function is to allow Developers to build, run and scale applications easily, and it does an excellent job at this.

But what if you want to handle sensitive information with the applications you run on CF?

Here's a simplified diagram of the network interactions inside CF, and their protection.

We can see that a malicious cloud infrastructure provider (or a disgruntled employee, or talented blackhat with an APT) could intercept all the application traffic as it passes through the CF router in plain text. You're relying on the security of the infrastructure provider to protect your traffic.

Convincing auditors and security teams that credit card information, bank details, etc. are adequately protected could be problematic.



Scenario 2 - Admin/Developer Security


CF by default uses the same channels for Admin and Developer access as it does serving Application traffic. It's susceptible to network sniffing at the CF router, or the UAA.

If UAA tokens or user credentials are obtained through this method, this gives an attacker access to the CF API to make whatever changes the captured credentials have permission for.


Perimeter Security vs Defense in Depth


CF uses traditional perimeter security models, with some exceptions for internal(-ish) login and UAA calls.

We can enhance security by not trusting the internal networks and encrypting everything.

There are two ways we can do this. The first is the way CF has been evolving so far, with application protocol level encryption. This is effective, but the implementation of SSL in different components and frameworks varies enormously, and maintaining SSL keychains for all components if a maintenance headache.

The second way is ....


IPSEC


IPSEC has been around in it's current form since the early 2000s, it's mature and is supported natively by the Linux kernel. It's most well known as the basis for remote-access VPNs, but a
lesser known use is encrypting traffic between computers on the same network.



How does IPSEC work with Cloud Foundry?


I've developed a BOSH release that enforces IPSEC communication between BOSH deployed systems. Deployed on VMs alongside CF components, it transparently encrypts all traffic in transit.

No CF code changes are necessary, just a co-located IPSEC release alongside CF, and a customised deployment manifest.

An IPSEC policy is added for each member that is participating in the IPSEC mesh. This makes it possible to be selective about what is protected, as well as keeping traffic to unrelated systems (DNS, log servers, etc.) unencrypted.

An IKE daemon is used to rotate the encryption keys on a user defined schedule. The exact encryption policies are configurable through the BOSH manifest.

Future Cloud Foundry security features by CloudCredo.


CloudCredo are continually working on  features to improve the security and usability of Cloud Foundry.

Monday, 4 February 2013

Production quality Play apps. - part 3 - Using JMX to monitor and tune Play applications


Play Framework applications use the JVM in the same way as any other Java application. In the previous article, I showed a start script for a Netty based play application that enabled JMX in a secure way, so we can monitor the status of the JVM.

Finding JMX connectivity details


Here's an example of a running Play process.


995      18203     1  1 03:49 ?        00:09:17 java -Dhttp.port=9000 -Dhttp.address=0.0.0.0 -Dcom.sun.management.jmxremote.port=7020 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/etc/play/play_app/jmxremote.password -Dcom.sun.management.jmxremote.access.file=/etc/play/play_app/jmxremote.access -Dcom.sun.management.jmxremote.authenticate=true -Djava.rmi.server.hostname=10.11.12.13 -XX:+UseConcMarkSweepGC -Xloggc:/opt/play_app/logs/gc.log -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDateStamps -verbose:gc -Xms256M -Xmx256m -server -Dconfig.file=/etc/play/play_app/env-application.conf -Dlogger.file=/etc/play/play_app/env-logger.xml -cp /opt/play_app/lib/* play.core.server.NettyServer /opt/play_app/

From this, you can extract the JMX port (jmxremote.port=7020), and find the JMX username/password file (jmxremote.password) - All three bits of information are necessary to connect.

GC Logging


We can also check the GC log if necessary (Xloggc:/opt/play_app/logs/gc.log)  - The information isn't quite as good as using JMX directly, but is useful for a quick diagnosis of a problem.

Essentially, lines like this show normal GC activity.

2013-02-04T17:06:16.627+0000: 47827.125: [GC 383171K->349037K(1044352K), 0.0158340 secs]
2013-02-04T17:12:34.481+0000: 48204.979: [GC 383149K->349028K(1044352K), 0.0193200 secs]
2013-02-04T17:18:56.583+0000: 48587.081: [GC 383140K->349008K(1044352K), 0.0204980 secs]
2013-02-04T17:25:17.088+0000: 48967.586: [GC 383120K->349015K(1044352K), 0.0058240 secs]

Note the last field - GC time in seconds. This should be nice and low - certainly sub-second.

Out-of-memory problems would lead to a log with repeated Full GC lines - similar to:

2013-01-18T10:53:42.184+0000: 4886909.890: [Full GC 2088630K->2055966K(2088640K), 6.4534550 secs]
2013-01-18T10:53:51.453+0000: 4886916.393: [GC 2080393K(2088640K), 0.0550080 secs]
2013-01-18T10:53:51.576+0000: 4886919.258: [Full GC 2088639K->2053535K(2088640K), 6.2507810 secs]
2013-01-18T10:54:00.613+0000: 4886925.553: [GC 2077526K(2088640K), 0.0533610 secs]
2013-01-18T10:54:00.728+0000: 4886928.562: [Full GC 2088633K->2055761K(2088640K), 6.3442360 secs]
2013-01-18T10:54:09.999+0000: 4886934.939: [GC 2079647K(2088640K), 0.0538060 secs]
2013-01-18T10:54:10.115+0000: 4886937.753: [Full GC 2088638K->2057070K(2088640K), 6.1674520 secs]
2013-01-18T10:54:19.016+0000: 4886943.956: [GC 2081495K(2088640K), 0.0553910 secs]
2013-01-18T10:54:19.118+0000: 4886946.803: [Full GC 2088639K->2054971K(2088640K), 6.1280980 secs]
2013-01-18T10:54:28.024+0000: 4886952.964: [GC 2081821K(2088640K), 0.0521580 secs]
2013-01-18T10:54:28.106+0000: 4886956.119: [Full GC 2088639K->2060868K(2088640K), 6.3114750 secs]
2013-01-18T10:54:37.534+0000: 4886962.475: [GC 2081790K(2088640K), 0.0520260 secs]
2013-01-18T10:54:37.636+0000: 4886965.448: [Full GC 2088639K->2058876K(2088640K), 6.1640440 secs]
2013-01-18T10:54:46.712+0000: 4886971.653: [GC 2080842K(2088640K), 0.0571690 secs]
2013-01-18T10:54:46.807+0000: 4886974.582: [Full GC 2087100K->2059338K(2088640K), 6.2864710 secs]
2013-01-18T10:54:55.971+0000: 4886980.911: [GC 2083885K(2088640K), 0.0530780 secs]
2013-01-18T10:54:56.059+0000: 4886983.794: [Full GC 2088605K->2058919K(2088640K), 6.3044720 secs]
Note that Full GC is taking approx 6 seconds each time here, making the application unusable.

Using Jconsole with the JVM running Play.


Jconsole is an application shipped with the Oracle JVM (and others) that enables us to connect to the JVM remotely via JMX and view internal metrics. Here's a very brief introduction to what it can do.



The preferred way to run it, is to run jconsole on your personal machine, and connect remotely to the application JVM. Start it up, feed it:
the address of the machine running the application, the port it's running on
the username and password for JMX.







Here's Jconsole's overview screen. Important things to note here are CPU usage of the application, and the memory consumption. 

This sawtooth memory consumption graph is ideal - The base of the sawtooth should be flat - indicating memory usage after garbage collection is stable, and not increasing. Continuously increasing base memory usage could indicate a memory leak, code deficiencies, or simply insufficient memory assigned.




Here's the display of the memory and GC statistics. Compare the graph values to the Max values on the display below to show much of the heap has been used.

Keep an eye on the GC time values in the lower value.
The New Generation (ParNew in this case) should tick up gradually, but in a tiny fraction of realtime.
The Old Generation (CMS in this case), should tick up very occassionally - and in tens to hundreds of a second chunks.


Explaining how to do GC tuning and analysis of memory usage is worth a series of articles on it's own, and this is a subject I'll cover at a later date. Until then, here a good article to get you started.

Collecting JMX data

GC logging is a relatively blunt tool, though log analysis tools like GCViewer help. Similarly, you can't really leave jconsole connected 24/7 to see what your application is doing.

The answer is to collect selected JMX metrics with another application, and store them. There are many applications that can do this - some of my favourites are.


Having high-resolution JMX data at the time a problem occurred can be invaluable in solving problems, as you can trace what was happening immediately before the problem occurred.






Production quality Play apps. - part 2 - Puppet module


In the previous article, I went through a few common options for getting your Play code onto a target system in a controlled manner.

Now, here's the puppet module I use for configuring and running the Play application itself - you can download it from http://www.jmips.co.uk/play_framework/ -  This should be easy to get working with some basic Puppet experience.

Unpack play_framework_apps_with_puppet.tar.gz , and you should see two modules - "example_play_app", and "play_framework".

The "play_framework" module is a puppet defined type. The idea is you call it, as in  example_play_app/manifests/init.pp

  # this defined-type does all the heavy lifting to create a service
  # for play_framework stuff, we just need to feed it a name and uid/gid
  play_framework::play_service { "${module_name}":
    play_uid => '995',
  }

Example Play App module


By default, they play app will take it's name from the module name - so likely you'll want to rename it at some point. After you've renamed the module directory and class name, you need to choose a free UID to run your play application under. The name of the user created will be named after the application. Beware - avoid hypthens in puppet module names! It is possible to use hyphens, but it will complicate things.

After you've chosen a name and UID, take a look at the templates directory for the example play apps.

./templates/env-logger.xml.erb
./templates/start_script.erb
./templates/env-application.conf.erb
The first file is the logger configuration - you should configure your play application to look for this file, and configure it appropriately

The second is the script used to start the application - you should adjust this to your needs.

#!/usr/bin/env sh
exec java -Dhttp.port=9000 \
-Dhttp.address=0.0.0.0 \
-Dcom.sun.management.jmxremote.port=7020 \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.password.file=/etc/play/<%= module_name %>/jmxremote.password \
-Dcom.sun.management.jmxremote.access.file=/etc/play/<%= module_name %>/jmxremote.access \
-Dcom.sun.management.jmxremote.authenticate=true \
-Djava.rmi.server.hostname=<%= ipaddress %> \
-XX:+UseConcMarkSweepGC \
-Xloggc:/opt/<%= module_name %>/logs/gc.log \
-XX:-OmitStackTraceInFastThrow \
-XX:+PrintGCDateStamps \
-verbose:gc \
-Xms256M -Xmx256m \
-server  \
-Dconfig.file=/etc/play/<%= module_name %>/env-application.conf \
-Dlogger.file=/etc/play/<%= module_name %>/env-logger.xml \
-cp "/opt/<%= module_name %>/lib/*" play.core.server.NettyServer /opt/<%= module_name %>/

In this application we're using Netty, configured to run on port 9000 and JMX is enabled with a username/password. Java heap size is set to 256Mb, CMS GC is enabled, GC logging is enabled. All of the file locations use the puppet module name.

The third file -  env-application.conf.erb  - contains application specific configuration. Use this to set configuration that varies between environments - ie where a database server is located.

Play Framework module

The defined type module may need some adjusting for your environment, the one I use is configured for Ubuntu, using Upstart to run the Play service.

Again, take a look in the templates directory for this module.
./templates/jmxremote.password.erb
./templates/play_upstart.conf.erb
./templates/jmxremote.access.erb
The JMX files relate to the usernames/passwords and permissions for JMX connectivity. See this document to set this up correctly. My templates use puppet variables to allow for multiple play apps/jmx passwords. Either set these variables in your site manifests via extlookup, or hardcode the JMX config into the modules (not recommended for good security!)

The Upstart file is used to start, stop and monitor the Play application.

description "<%= play_service_name %>"
start on filesystem
stop on runlevel [!2345]
respawn
respawn limit 10 5
umask 022
oom never
env PLAY_SERVICE_HOME="/opt/<%= play_service_name %>"
post-stop script
        rm ${PLAY_SERVICE_HOME}/RUNNING_PID || true
end script
exec start-stop-daemon --start -c <%= play_service_name %> -d ${PLAY_SERVICE_HOME} --exec /etc/play/<%= play_service_name %>/start

Upstart will ensure the application is running, monitor it via it's PID, and restart it if it crashes. You should be able to start/stop/restart the service with "service example_play_app <stop/start/restart>"



Hopefully this should be enough to get your play application started and running. Next time, I'll briefly cover using JMX to monitor and tune the JVM of your Play application.

Friday, 1 February 2013

Production quality Play apps. - part 1 - Automatic Deployment

I've been working with a lot of Play applications recently, and was asked to share how I built the automatic deployment, config management, integration into Linux, etc.

In this post I'll cover the automatic deployment options, and it will be solely focused on how to get binaries onto systems in a controlled and repeatable fashion.


Automatic Deployment 


This all assumes you have a working build of your Play application already.

I'm showing the raw commands to run to get it to work, but this all should be scripted/automated with your automation system of choice. Here's an example shell script, but Mcollective or Saltstack should work too.

Method 1 - Quick and dirty - Deploy straight from Jenkins


This is quite simple, and perfect for a fast deploy from a Build/Test system. I would recommend not using this for production deployments, as it's difficult to guarantee the exact version of what you're deploying!

This assumes you have a working Play app build in Jenkins, and successfully built binaries, usually with SBT's "dist" command.

On your target server to run the Play application.
wget -nv http://<jenkins_server>/job/<play_jenkins_job>/lastSuccessfulBuild/artifact/*zip*/archive.zip -O /tmp/play_deploy/archive-<play_app_name>.zip

This just retrieves the latest Jenkins artifact for the build job, and writes it to a known location on disk
 unzip -q -o /tmp/play_deploy/archive-<play_app_name>.zip  -d /tmp/play_deploy/
Here we unpack the Jenkins ZIP artifact, to reveal the actual Play ZIP file.
cd /tmp/play_deploy/ unzip -q /tmp/play_deploy/archive/dist/<play_zip_name>-*.zip
Here we've unpacked the SBT built ZIP distribution of the Play libs.
sudo /bin/rm -rf /opt/<play_service_name>/lib
Clean up the final destination for the Play app - assuming you've stopped it already!
sudo /bin/mv /tmp/play_deploy/<play_unzipped_dir>-*/* /opt/<play_service_name>/ "
The result of this should be /opt/<play_service_name>/lib filled with the libraries from your Play application. You will likely have to tweak the names slightly, as the outer ZIP wrapper is related to the name of the Jenkins job, and the inside ZIP name is related to the Play project name.

At this point, you're ready for Puppet to take over.

Method 2 - Deploy from a binaries file store.


It's quite common to use a build system to populate a filesystem based binary store. eg. /mnt/play_binaries/<play_project_name>/<version>/<project_name>.zip

Implementing this is outside the scope of this article, but I'd use the Copy Artifact plugin in Jenkins.

To deploy from this, use wget/rsync or similar to copy the file to your target system in the /tmp/play_deploy/ directory, then, as the previous example

cd /tmp/play_deploy/ 
unzip -q /tmp/play_deploy/archive/dist/<play_zip_name>-*.zip 
sudo /bin/rm -rf /opt/<play_service_name>/lib
sudo /bin/mv /tmp/play_deploy/<play_unzipped_dir>-*/* /opt/<play_service_name>/ "

Again, at this point you should have /opt/<play_service_name>/lib populated with the Play application libraries, and you're ready for Puppet to take over the next step.

Method 3 - Deploy from Maven.


This assumes you've published your binaries to Maven, using SBT or similar, as in this example.
Be aware that SBT 0.11 does not properly generate the Maven metadata, the result of which is asking for the "latest" version will not work, but asking for a specific version will. See here for a solution, particularly the sbt-aether-deploy plugin.

If you chose to use Maven for deployment, be sure you know the difference between Release and Snapshot versions, so you're able to guarantee the versions of your code, and all dependencies in each production release.

To deploy, you'll need a custom maven XML file like this one, on the target system. - call it <play_app_name>.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.fake</groupId>
  <artifactId>fake</artifactId>
  <version>1.0</version>
  <packaging>jar</packaging>
  <name>fake</name>
  <url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<excludeTransitive>false</excludeTransitive>
<outputDirectory>/tmp/play_deploy/play_service_name/lib/</outputDirectory>
</properties>
<dependencies>
<dependency>
<groupId>com.mycompany</groupId>
<artifactId>my_play_artifact name</artifactId>
<version>version_to_deploy</version>
<type>jar</type>
</dependency>
</dependencies>
</project>


Assuming you've got your local Maven config set up correctly (ie. your ~/.m2/settings.xml as in this example ) so Maven on your target system can see your internal Maven repo, you can now use Maven to request your libraries.


cd /tmp/play_deploy/ 
mvn -U    dependency:copy-dependencies -f <play_app_name>.xml


This should result in the version you selected in the XML file being deployed to the path you specified in the XML file, along with all dependencies required to run it.

Now, as before, move it to the correct place.

sudo /bin/rm -rf /opt/<play_service_name>/lib
sudo /bin/mv /tmp/play_deploy/<play_service_name>-*/* /opt/<play_service_name>/ "

 Now the code is deployed, you're ready for Puppet to take over.

Method 4 - System Packaging


Systems Admins already have tools to automatically check/deploy/upgrade the software on the systems under their control. RPM and deb are the most popular formats for Linux

Converting Play applications (typically distributed in ZIP format) into these formats is a little more involved, but worth it for larger system estates. I'll describe the process, but the specifics are a little outside a blog post. Perhaps I'll cover it in the future.

Essentially, the idea is as part of your build process, to take the ZIP file, convert it into a RPM/deb, then upload it to your repository. Then it will be available to be installed, using native OS commands, as if you were installing a new version of Apache HTTPD.

This project aims to bring this support natively to SBT, but until then, you're going to have to script this yourself, with the help of some Jenkins plugins like this.


Next time I'll cover the Puppet module I use to automate the configuration and integration of the Play app into the Linux environment.


Friday, 25 January 2013

Performance Analysis - Part 3: Develop performance tests



Last week, we looked at how to choose a test metric, to measure the state of a problem. Just as important, this will allow us to measure the difference our changes are making to the system.

The conclusion was, if we can sustain the load seen at peak ( 20 home page requests per second ), and keep the full request time under 10 seconds, then our problem has now improved.If the response time is worse, or very irregular, then we're moving in the wrong direction.
Now, let's look at how to turn our idea for a test into reality.

Basic performance tests with Jmeter.


We're going to use Jmeter, because it's very easy to get started with. It's not suitable for everything - for example, I'd recommend against using it for microbenchmarking of application components - but it's powerful, flexible, scalable, and easy to use.

Getting started & Thread Groups.


Make sure you have a recent version of Java installed, then install Jmeter from here.
Now go and install the extra Jmeter plugins here. These externally managed plugins add a great set of additional tools to Jmeter, including some much needed better graphing.




Then, start Jmeter,you should see Jmeter's main window.










The first thing we need to add is a Thread Group, this controls a few global parameters about how the test will run. eg.


  • how many threads (simulated users, in this case).


  • how long the test takes to ramp up to full user load.
  • how many times the thread loop runs.
  • how long the thread runs for.


Right click on "Test Plan", choose "Add" --> "Threads" --> "Thread Group"

From our previous investigations, we obtained a few numbers. The first important one is:
"100 simultaneous active clients" - this is our number of active threads.

Also select "loop count" = forever, and set the schedule to 600 seconds (10 mins) - don't worry, it will ignore the Start/End time. Unless you know otherwise, always set a reasonable length of time for your tests. 1 minute is often not enough to fully warm-up and saturate a system with load.

Adding Requests Defaults


The first thing we need to add is some global HTTP configuration options.
Right click on the "Thread Group" you added previously, and choose "Add" --> "Config Element" -> "HTTP Cache Manager", repeat for "HTTP Cookie Manager", "HTTP Request Defaults" and "HTTP Header Manager".

In "HTTP Cache Manager", set:

  • "Clear cache each iteration". We want each test request to be uncached, to simulate lots of new users going to the site.


In "HTTP Cookie Manager", set

  • "Clear cookies each iteration". Again, we want each new simulated user to have never been to the site before, to simulate maximum load.


In "HTTP Request Defaults", set

  • Any headers that your users browser typically set. eg. "Accept-Encoding", "User-Agent", "Accept-Language". Basically anything to trick the test system into thinking a real browser is connecting. 


In "HTTP Request Defaults", set:

  • Web Server name or IP  -  your test system address. Don't use your "live" system, unless you have absolutely no other choice!
  • "Port number" -  usually 80
  • "Retrieve all embedded resources from HTML files" - true - Jmeter does not understand Javascript. This will parse any HTML it finds and retrieve sub-resources, but will not execute javascript to find any resources. If javascript drives web-server requests on your site, you'll need to add those requests manually.
  • "Embedded URLs must match" - a regex for your systems web URLs. This is here to prevent you from accidentally load testing any external web hosted files. Likely those external providers will not be happy with you if you do any unauthorised load tests.


Adding the Home Page Request


Now, we can add our home page request.

Going back to last week's post again, remember we needed to simulate a load of 20 home page requests per second, and keep the response time under 10 seconds.

Add "Sampler"--> HTTP Request":
name it "Home Page"
set the URL to "/" - it will inherit the default values we set earlier, including the server URL to use.





We wanted to simulate 20 home page requests per second, so we need to restrict the test to do that, otherwise it will loop as fast as it can.

Click on the "Home Page" item you just added, then add "Timer" --> "Constant Throughput Timer", then set:

  • Target throughput (per min) - 1200  (20 per second)
  • Calculate throughput - "All active threads". (this has some disadvantages, but the better solutions are outside the scope of a "getting started" guide!)




Adding analysis plugins.


One of Jmeter's great strengths is it's data-analysis and test-debugging tools, we're going to add the minimum required to our test.

Now add the following "Listeners"


  • "View Results Tree" - Used for a quick view on the data you're sending and receiving. Great for a quick sanity check of your test.
  • "Aggregate Report" - Aggregated stats about your test run.
  • "jp@gc Response Times over Time" - A view of how your response times change over time.


First test run!


If you haven't already done so, save your test.

Now we need to select a place to run our test from. Running a performance test from a slow laptop, connected via WiFi, to a contended DSL line, is not a good idea.

Choose somewhere with suitable CPU power, reliable network links, close to your target environment. The test must be repeatable, without having to worry about contention from other systems or processes.

Run the test.


Analysing test results.


You should keep detailed notes about each test run, noting the state of the environment, and any changes you've made to it. Version your tests, as well as keeping records of the result. The ability to look back in a spreadsheet, to look at results from a similar test weeks ago, is very valuable.

Here's a quick look at an example result.


  • Results Tree - each result should come up in green, showing a 2xx response from the server. By default, Jmeter treats non 2xx response codes as a failure. This view allows you to do a quick debug of your test requests and responses to ensure it's behaving as you expect. Once the test works correctly, disable the results tree, as it slows down the test.





  • Aggregate Report - statistics about each request type that was made. (Taken from an unrelated test).













  • Response Times over Time - self explanatory. (Taken from an unrelated test) 












Automation.


For the most reliable results, you shouldn't run the tests in interactive mode, with live displays and analysis. It skews the results slightly, as your computer is wasting CPU cycles processing the displays.

Instead, using command line mode, disable all the analysis tools, and use the Simple Data Writer to write a file to disk with the results, for later analysis. All the analysis plugins that I've used here accept data loaded back from one of these results files, so you can analyse and re-analyse results files at your leisure.

I'll cover the techniques to run Jmeter tests in an automated fashion, with automated analysis, in a future blog post.

Conclusion


We've now got a repeatable way to measure our chosen performance metric(s).
In our imaginary case, we're going to pretend that the test we ran showed that Home page load speed averaged 15 seconds across our 10 min test, and that the Response Time over Time started out at approximately 5 seconds, and increased to 15 seconds within a few minutes of test start.

In this example, this shows our test is part of the cause, and not just an effect. If we had run the test, and found that response times were acceptable, then our test would have just been measuring the effect.

Measuring only the effect is not a bad thing, as it gives us acceptance criteria for fixing the problem, but it means we would have to look for the cause later on, after we've instrumented the environment.


Next time, I'll cover how to approach instrumenting the environment, to collect the data we need to know what's happening to the environment.





Saturday, 19 January 2013

Performance Analysis - Part 2: Choosing a test metric


In the previous post, I discussed how to start the process of gathering all the information together that you need to understand the system you're troubleshooting.

Now, let's try and understand how to select a metric for measuring the state of the problem you need to troubleshoot.

Describing the Problem:

Unfortunately, the first contact with a new problem usually starts with the wonderfully unhelpful statement. "My XYZ site/widget/job is slow"....

SysAdmins/Support people will recognise this as similar to "My computer doesn't work". It doesn't tell you much about the problem!


So the point of this exercise is to:

  1. Clarify the problem
  2. Verify the problem and record it in progress
  3. Narrow down exactly how to measure the problem



Clarify the problem


If we take the very simple LAMP stack environment from my previous post as an example....


On closer questioning, the clients report that the web-site home page starts being very slow, all of a sudden, and that this happens at random intervals, and that the site quickly becomes unusable.

Verify the problem and record it in progress

In this case,  it's time to break out tcpdump and wireshark. I'll delve into the details about how to use them to track changes in response time in another post, this post on another blog seems relevant.

Interrogating our imaginary packet capture from in front of the loadbalancer, we see a large number of HTTP requests incoming. They consist of a few static files, and a dynamic request. The response time for the static files seem a little changeable, but the dynamic requests seem hugely variable in response times. Eventually the site dies. Home page requests increase to approximately 20 per second at peak.

You could also verify this by looking at load balancer metrics (if it's smart enough), or by instrumenting the Apache servers to record request time, and analysing the logs (only if you're sure the load-balancer is not part of the problem!). I'll describe how to start instrumenting Apache in a later post.

Narrow down exactly how to measure the problem

The accepted standard for the time a user will wait before getting bored, is about 10 seconds.

If we can sustain the load seen at peak ( 20 home page requests per second ), and keep the full request time under 10 seconds, then our problem has now improved.
If the response time is worse, or very irregular, then we're moving in the wrong direction.

We now have a metric to measure the problem by.

Cause or Effect?

The metric we have chosen may not represent the actual problem, it may just be an effect of the problem. At this stage, this doesn't matter, we're just after a metric that measures the effect.


Next time:

 In the next post I'll look at how to use Jmeter to try and replicate the problem, and test for improvements.