Here's the blurb about the talk and the really knowledgeable speaker:
"Continuous Deployment takes continuous integration one-step further, where every commit goes live to production servers. When this process is described it is frequently met with skepticism around site reliability and the ability to scale a business this way, but it works, it scales (with challenges) and it is embraced by the entire organization. IMVU is a leader in Continuous Deployment, with over 5 years of experience scaling this process to support a technical staff of 50 and a business of more that $40 million in annual revenue. Brett G. Durrett, Vice President of Engineering & Operations for IMVU explains the basic mechanics of Continuous Deployment and discusses the value it creates for the entire company. Specific topics that will be covered: Attendees will understand that releasing to customers 20+ times per day is possible and that it does scale from individual developers to large companies. In addition, they will understand how they can make Continuous Deployment successful at their company, from both a technology and cultural standpoint.
Brett G. Durrett has over 20 years experience leading development of software and systems ranging from large-scale Internet services to video games. He serves as VP of Engineering at IMVU where he leads the engineering and technical operations teams and was responsible for the operations infrastructure that successfully scaled from two machines to over 700 servers. Prior to IMVU, Brett served as the Director of Engineering, VP of Operations and General Manager for the virtual world at There.com. Brett was also co-founder and CEO of Asylum Entertainment, a game development company."
You can watch the talk (in two parts) and see the slides above. I'm pretty much sold on what Brett preaches and am thinking of how to implement continuous deployment in my current projects. He says that having little code and process in place puts you at an advantage, though I'm still wondering how to put in the right infrastructure to have all the tests and deployment run as smoothly and automatically as they do (and how much to prioritize this process infrastructure work around other initial start-up goals).
My notes on the talk are below. Overall, I learned a lot and very much enjoyed hearing Brett speak.
Their process:
- develop a feature increment
- verify on buildbot
- commit code to live in production immediately for some set of customers
- whole process takes 15 min, release about 50 times per day
- no staging cluster
- no QA review
- most companies develop, release, then pray for customers
- now, smart companies develop, release, learn, iterate
- minimize total time through build, measure, learn cycle
- release overhead reduces opportunity to iterate
- way easier to find regressions/bugs in small batches of commits
- fast response times for business opportunities
- more turns at bat
- book: Principles of Product Development Flow (reducing batch size, lean product development); reducing batch size reduces cycle time, reduces variability in flow, accelerates feedback, reduces risk, reduces overhead; large batches reduce efficiency, inherently lower motivation and energy, cause exponential cost and schedule growth, lead to even larger batches; the entire batch is limited by its worst component
Work process:
- local tests pass, engineer commits code
- lots and lots of tests run
- all tests pass [if no, revert commits]
- code deployed to % of servers
- metrics good [if no, rollback]
- code deployed to all servers
- metrics still good [if no, rollback]
- win
amount of time you need to run test depends on volume of people going through funnel
all work done on trunk (no work on branches)
- avoids merge conflicts
- all code gets validated in production immediately to test now
- at bottom sees actual PHP test files and their status (time to complete, running status, etc.)
- a tag includes multiple PHP test files
- tests run before checkin on local sandbox
- push for being test-driven but let people work how they want to work
- each person responsible for writing tests for their own code
- local sandbox test suite running through a web browser
- checkboxes: stop after last test, pause after failure, run tests in random order, only run selected tests
- want to make testing as unburdensome as possible
use selenium
continuous integration: they use buildbot, others use hudson, jenkins, atlassian
bamboo
build servers
- good screenshot in slides of buildbot view
- each box represents a server
- split all the tests up between multiple servers that takes an 8 hour build to be an 8 minute build
- each server running many tests; they have 40K tests running through test suite
- having good tests allows new people to start working and new experiments to happen quickly
- unit tests of code
- user workflow tests of site UI
- if code fails in build server, email goes out and immediately the engineer's supposed to revert the code so others can continue to use build server
- saves and emails output of the test failure
Deployment:
- code rolled out to cluster
- a bunch of perl and rsync code
- symlinks on site
- keep multiple copies of code
- process of rolling forward and backward is just changing symlink
- monitors metrics
- system performance (web services, disk space, DNS, cron, API availability)
- business performance (various critical actions/functions, graphs, revenue, registrations)
- use nagios for system and business metrics
- if metrics bad, do rollback on cluster (changes symlinks back to previous release, blocks further commits, sends email)
- server push status web page to diagnose rollback and which metrics killed the push
- one unfortunate thing in the system is false positives due to real variability in business
- once metrics good, goes out to entire cluster
- most wait periods: a couple minutes
- something it's not very good at: catching very small changes that hurt
- was manual for a while, hacked together
- only recently got good test coverage of deployment system (some not even in repository)
- don't change deployment system that often
everyone emails changes to the change list (basically everyone in company) with before and after state and people can catch problems
they have one monolithic code base
don't have anything that ensures they have test code coverage automatically
Getting Started (story):
- there were no customers
- he came in for operational role
- engineers wrote code and SSH'd in to cluster
- no auditing, no monitoring
- would see PHP syntax errors on homepage
- only 30 customers at that time so didn't matter
- set the culture of getting stuff out there
- wrote a nagios check for "are we rendering HTML out to the customer?"
- if you're writing new code, it should have some coverage (functional easiest at first)
- commit to making forward progress
- start w/ sandbox
- just push
- ideal time for failures
- start w/ production
- automate deploys. first automate the push. then automate QA.
- build confidence
if new code breaks something old, must write test to catch that
expect some hurdles:
- you will have cluster outages
- you will spend engineering time on deployment system
- have culture where failures are looked at as opportunities
- how do we get excited about never letting this happen again
- if have blame-searching culture, will have more challenges
- buildbot would go red, and everyone would be blocked
- when build time 20-30 minutes, bad news
- problem with intermittent tests
- build isolation [but not solution; didn't need to build this because could get away with faster test runs, buying hardware and virtualization, sorting tests by speed, dependency injection by instead of calling on real DB, just getting data that would be returned, and also built a hypothesis builder, which is like build isolation where you tag code to run on hypothesis builder that does not run on main buildbot and doesn't block anyone if it fails]
- added a test metrics system that keeps track of success rate and speed (a lot of builds were blocked on slow tests)
- got build times down to 8 minutes
- when builds were over 25 minutes, it was a huge cultural issue
- disable or ignore the test
- third-party providers
- running tests around time and time spans is much more challenging than normal tests (DST, leap years, etc.)
- state dependency across tests (overnight, keep running tests in random orders until they become red, and then in morning you see which tests are intermittent and can investigate)
- they run about 40K tests now
- even with 5 9's of reliability, you get many failures
- move them from having to fix them when they happen to fixing them on a schedule
- if buildbot gets a test that runs green once and then red another time, it will mark it as intermittent, start an issue in bug tracker, and allow the build to go through
- catching issues that fail slow (SQL selects from growing tables)
- critical areas cause hard lock-ups (MySQL, memcached)
- lack of test coverage of older code: not an issue if you start with test coverage
- outsourcing (different hours, culture, branching, slower integration)
added query killer (issues kill statements on long queries; better to have code die than DB to be overloaded and take down everybody)
schema changes on large tables (they use mysql):
- create a new table
- do copy on read
- have background process later migrate the rest of the data
hard to work with outsourcers who build over several days (impossible to integrate)
build system itself is critical business function; keep metrics on build system (web dashboard of build process)
integration with A/B testing inside the code (nice slide with pseudocode)
- name the experiment
- specify initial rollout % or amount of users
- specify customer branches with percentage weightings of what % should see enhanced versus non-enhanced (e.g., 50% A/B split)
- helper function that returns which branch a certain customer should see (enhanced or not) and if not yet assigned then to permanently assign [so customer always gets same experience]
- simple if statement that splits between if user should see test feature or not
- web page listing all experiments (available to everyone in company)
- to user % (QA and admin only, 0%, 10%, etc.)
- closed on status (they have a page that lists experiments that were closed but the code still exists; this allows easy housekeeping to remove unused code after a while)
sprints:
- planned sprint schedule usually not met (outstanding issues, incomplete features, tech review, refactoring)
- when releases happen every 15 minutes, "planned sprint ends" can be arbitrary
- changed to just say that the sprint ends when the work is done (but still understand overage reasons)
- first day on job, engineer pushes out to live customers immediately
- makes people feel empowered
- hack-week: you can build anything and company provides food and drink
- if you're convinced something's important for customers, just build it and allowed to release to 1% of customers without approval

RSS Feed



