Notes on Brett Durrett at LeanLA Talk on Continuous Deployment

1/9/2012

Continuous Deployment at Lean LA

View more presentations from Brett Durrett

Another awesome talk by the guys at LeanLA and IMVU!

Here's the blurb about the talk and the really knowledgeable speaker:

"Continuous Deployment takes continuous integration one-step further, where every commit goes live to production servers. When this process is described it is frequently met with skepticism around site reliability and the ability to scale a business this way, but it works, it scales (with challenges) and it is embraced by the entire organization. IMVU is a leader in Continuous Deployment, with over 5 years of experience scaling this process to support a technical staff of 50 and a business of more that $40 million in annual revenue. Brett G. Durrett, Vice President of Engineering & Operations for IMVU explains the basic mechanics of Continuous Deployment and discusses the value it creates for the entire company. Specific topics that will be covered: Attendees will understand that releasing to customers 20+ times per day is possible and that it does scale from individual developers to large companies. In addition, they will understand how they can make Continuous Deployment successful at their company, from both a technology and cultural standpoint.

Brett G. Durrett has over 20 years experience leading development of software and systems ranging from large-scale Internet services to video games. He serves as VP of Engineering at IMVU where he leads the engineering and technical operations teams and was responsible for the operations infrastructure that successfully scaled from two machines to over 700 servers. Prior to IMVU, Brett served as the Director of Engineering, VP of Operations and General Manager for the virtual world at There.com. Brett was also co-founder and CEO of Asylum Entertainment, a game development company."

You can watch the talk (in two parts) and see the slides above. I'm pretty much sold on what Brett preaches and am thinking of how to implement continuous deployment in my current projects. He says that having little code and process in place puts you at an advantage, though I'm still wondering how to put in the right infrastructure to have all the tests and deployment run as smoothly and automatically as they do (and how much to prioritize this process infrastructure work around other initial start-up goals).

My notes on the talk are below. Overall, I learned a lot and very much enjoyed hearing Brett speak.

Their process:

develop a feature increment
verify on buildbot
commit code to live in production immediately for some set of customers
whole process takes 15 min, release about 50 times per day
no staging cluster
no QA review

Why would you do something like that?

most companies develop, release, then pray for customers
now, smart companies develop, release, learn, iterate
minimize total time through build, measure, learn cycle

Why continuous deployment is good:

release overhead reduces opportunity to iterate
way easier to find regressions/bugs in small batches of commits
fast response times for business opportunities
more turns at bat
book: Principles of Product Development Flow (reducing batch size, lean product development); reducing batch size reduces cycle time, reduces variability in flow, accelerates feedback, reduces risk, reduces overhead; large batches reduce efficiency, inherently lower motivation and energy, cause exponential cost and schedule growth, lead to even larger batches; the entire batch is limited by its worst component

Work process:

local tests pass, engineer commits code
lots and lots of tests run
all tests pass [if no, revert commits]
code deployed to % of servers
metrics good [if no, rollback]
code deployed to all servers
metrics still good [if no, rollback]
win

amount of time you need to run test depends on volume of people going through funnel

all work done on trunk (no work on branches)

avoids merge conflicts
all code gets validated in production immediately to test now
at bottom sees actual PHP test files and their status (time to complete, running status, etc.)
a tag includes multiple PHP test files
tests run before checkin on local sandbox
push for being test-driven but let people work how they want to work
each person responsible for writing tests for their own code
local sandbox test suite running through a web browser
checkboxes: stop after last test, pause after failure, run tests in random order, only run selected tests
want to make testing as unburdensome as possible

great slide in presentation with sample output of "RunTests" test view which allows filtering tags, turning test on/off, seeing tests that pass, fail, run, skip, wait, etc.

use selenium

continuous integration: they use buildbot, others use hudson, jenkins, atlassian

bamboo

build servers

good screenshot in slides of buildbot view
each box represents a server
split all the tests up between multiple servers that takes an 8 hour build to be an 8 minute build
each server running many tests; they have 40K tests running through test suite
having good tests allows new people to start working and new experiments to happen quickly
unit tests of code
user workflow tests of site UI
if code fails in build server, email goes out and immediately the engineer's supposed to revert the code so others can continue to use build server
saves and emails output of the test failure

Deployment:

code rolled out to cluster
a bunch of perl and rsync code
symlinks on site
keep multiple copies of code
process of rolling forward and backward is just changing symlink

hard part: cluster immune system

monitors metrics
system performance (web services, disk space, DNS, cron, API availability)
business performance (various critical actions/functions, graphs, revenue, registrations)
use nagios for system and business metrics
if metrics bad, do rollback on cluster (changes symlinks back to previous release, blocks further commits, sends email)
server push status web page to diagnose rollback and which metrics killed the push
one unfortunate thing in the system is false positives due to real variability in business
once metrics good, goes out to entire cluster
most wait periods: a couple minutes
something it's not very good at: catching very small changes that hurt

deployments of deployment system:

was manual for a while, hacked together
only recently got good test coverage of deployment system (some not even in repository)
don't change deployment system that often

aesthetic tests? they don't

everyone emails changes to the change list (basically everyone in company) with before and after state and people can catch problems

they have one monolithic code base

don't have anything that ensures they have test code coverage automatically

Getting Started (story):

there were no customers
he came in for operational role
engineers wrote code and SSH'd in to cluster
no auditing, no monitoring
would see PHP syntax errors on homepage
only 30 customers at that time so didn't matter
set the culture of getting stuff out there
wrote a nagios check for "are we rendering HTML out to the customer?"
if you're writing new code, it should have some coverage (functional easiest at first)
commit to making forward progress

new product advice:

start w/ sandbox
just push
ideal time for failures

established product:

start w/ production
automate deploys. first automate the push. then automate QA.
build confidence

new code must have test coverage.

if new code breaks something old, must write test to catch that

expect some hurdles:

you will have cluster outages
you will spend engineering time on deployment system
have culture where failures are looked at as opportunities
how do we get excited about never letting this happen again
if have blame-searching culture, will have more challenges

scaling:

buildbot would go red, and everyone would be blocked
when build time 20-30 minutes, bad news
problem with intermittent tests

solutions:

build isolation [but not solution; didn't need to build this because could get away with faster test runs, buying hardware and virtualization, sorting tests by speed, dependency injection by instead of calling on real DB, just getting data that would be returned, and also built a hypothesis builder, which is like build isolation where you tag code to run on hypothesis builder that does not run on main buildbot and doesn't block anyone if it fails]
added a test metrics system that keeps track of success rate and speed (a lot of builds were blocked on slow tests)
got build times down to 8 minutes
when builds were over 25 minutes, it was a huge cultural issue

flaky tests / intermittent tests have huge costs:

disable or ignore the test
third-party providers
running tests around time and time spans is much more challenging than normal tests (DST, leap years, etc.)
state dependency across tests (overnight, keep running tests in random orders until they become red, and then in morning you see which tests are intermittent and can investigate)
they run about 40K tests now
even with 5 9's of reliability, you get many failures
move them from having to fix them when they happen to fixing them on a schedule
if buildbot gets a test that runs green once and then red another time, it will mark it as intermittent, start an issue in bug tracker, and allow the build to go through

trickier bits:

catching issues that fail slow (SQL selects from growing tables)
critical areas cause hard lock-ups (MySQL, memcached)
lack of test coverage of older code: not an issue if you start with test coverage
outsourcing (different hours, culture, branching, slower integration)

changing schema requires sign off from tech lead (checking indexes, scalability of changes)

added query killer (issues kill statements on long queries; better to have code die than DB to be overloaded and take down everybody)

schema changes on large tables (they use mysql):

create a new table
do copy on read
have background process later migrate the rest of the data

memcache changes require second set of eyes (hard to test on local sandbox)

hard to work with outsourcers who build over several days (impossible to integrate)

build system itself is critical business function; keep metrics on build system (web dashboard of build process)

integration with A/B testing inside the code (nice slide with pseudocode)

name the experiment
specify initial rollout % or amount of users
specify customer branches with percentage weightings of what % should see enhanced versus non-enhanced (e.g., 50% A/B split)
helper function that returns which branch a certain customer should see (enhanced or not) and if not yet assigned then to permanently assign [so customer always gets same experience]
simple if statement that splits between if user should see test feature or not
web page listing all experiments (available to everyone in company)
to user % (QA and admin only, 0%, 10%, etc.)
closed on status (they have a page that lists experiments that were closed but the code still exists; this allows easy housekeeping to remove unused code after a while)

per-experiment dashboard to see user groups (male, female, etc.), #s, results (highlighted by desired/undesired colors) and p-values

sprints:

planned sprint schedule usually not met (outstanding issues, incomplete features, tech review, refactoring)
when releases happen every 15 minutes, "planned sprint ends" can be arbitrary
changed to just say that the sprint ends when the work is done (but still understand overage reasons)

IMVU culture:

first day on job, engineer pushes out to live customers immediately
makes people feel empowered
hack-week: you can build anything and company provides food and drink
if you're convinced something's important for customers, just build it and allowed to release to 1% of customers without approval

0 Comments

Readings and musings

Notes on Brett Durrett at LeanLA Talk on Continuous Deployment

Leave a Reply.

Archives

Categories

Subscribe