Test Driven Devops

Illustration: 500 - Internal server errorI like to apply Test Driven Development to my sysadmin work. For example, every time I add a new redirect to a web server configuration I want to make sure I haven’t broken anything else. Further, I want my SSL configurations proactively checked daily for any possible error. I use Ruby RSpec and write tests like these:

describe 'My app' do
  context 'www.myapp.com' do
    it { should be_up }
    it { should have_a_valid_cert }
  end

  it 'serves the www.myapp.com page without redirecting' do
    expect('http://www.myapp.com/about').to be_status 200
  end

  it 'only serves via www' do
    expect('http://myapp.com').to redirect_permanently_to 'http://www.myapp.com/'
  end

  it 'forces visitors to use https' do
    expect('myapp.com').to enforce_https_everywhere
  end
end

When I want to make a configuration change, I first write a test for the desired outcome. Naturally, it fails while the old tests pass. I then work on the config change, re-running all the tests as I go, and am finished when they all pass. I also run these automatically from a cron job to get pro-active notification of new problems.

The phrases such as have_a_valid_cert are custom RSpec matchers; they’re added into the RSpec environment by this open source library on Github.  I’m also working on an app to run these specs in the cloud.

See also

The Hidden Dangers of Beautiful Themes

A Tale of Seduction and Betrayal

Some of the best-designed and officially featured WordPress themes aren’t built to handle mid-volume traffic. Just one incoming link from a semi-popular page can take your server down.

A New Blog for a Web App

The featured “News” theme that crashed my server

Everything started out smoothly. Like thousands of developers do every day, I set up a new WordPress installation to support a new web app I’m getting online. I’ve got a Linode 1536 which is perfect for this ?. It has gigs of free disk space and 800MB of unused RAM just for cache. And these virtual servers are fast. Mine hosts about 15 Rails and WordPress apps and the system load never gets up to 1.0.

For me, the hardest (and most fun) part of setting up a new blog is choosing the theme. I didn’t want to waste time, so I looked only at WordPress’s one-page “featured” themes list and chose News — a conservative theme with personality.

I wrote a few posts to get started, posted a link to one on Reddit, and went to sleep.

I Woke Up but My Server Wasn’t There

The network traffic graph gives a dramatic view of the server crash

At around 9am, I was in for a shock: no web pages were loading and it took 2 minutes to simply ssh in. There were about a million Apache processes running and the system was out of memory. Checking on Reddit, I saw that the post was getting a good amount of traffic: it was at the top of the r/programming subreddit. It had a couple of hundred up-votes; a lot, but certainly not an apocalypse. So this was odd. I also saw that someone reposted it to the Hacker News. (Nice!) Except that the posts were only noting that the site was offline (Not nice.)

Discovering the Culprit: the Theme

A helpful Hacker News reader suggested “Caching is your friend”. That was my first thought as well. WordPress by default is just a PHP app, doing a lot of repetitious work with every request. But that didn’t feel right. This was a brand new blog, after all, and the requests were all for one simple page.

Check your web server configuration with ruby test cases

I’m releasing a new open source library, HTTP-Assertions, that I use behind the scenes at OregonLaws.org.  It helps me keep the server running smoothly and make sure that my changes to the config files haven’t introduced new bugs.

It introduces these new assert methods to the Rails testing environment:

  assert_200
  assert_forbidden
  assert_temp_redirect_to
  assert_perm_redirect_to

Here’s an example:

  assert_200 'https://www.oregonlaws.org/ors/161.360'

See the README on github for more details.