Max Mednik
  • Home
  • About
  • Interests
    • Angel investing
    • Magic
    • Scuba Diving
  • Blog
  • Contact

Readings and musings

Notes on Brett Durrett at LeanLA Talk on Continuous Deployment

1/9/2012

0 Comments

 
Continuous Deployment at Lean LA
View more presentations from Brett Durrett
Another awesome talk by the guys at LeanLA and IMVU!

Here's the blurb about the talk and the really knowledgeable speaker:

"Continuous Deployment takes continuous integration one-step further, where every commit goes live to production servers. When this process is described it is frequently met with skepticism around site reliability and the ability to scale a business this way, but it works, it scales (with challenges) and it is embraced by the entire organization. IMVU is a leader in Continuous Deployment, with over 5 years of experience scaling this process to support a technical staff of 50 and a business of more that $40 million in annual revenue. Brett G. Durrett, Vice President of Engineering & Operations for IMVU explains the basic mechanics of Continuous Deployment and discusses the value it creates for the entire company. Specific topics that will be covered: Attendees will understand that releasing to customers 20+ times per day is possible and that it does scale from individual developers to large companies. In addition, they will understand how they can make Continuous Deployment successful at their company, from both a technology and cultural standpoint.

Brett G. Durrett has over 20 years experience leading development of software and systems ranging from large-scale Internet services to video games. He serves as VP of Engineering at IMVU where he leads the engineering and technical operations teams and was responsible for the operations infrastructure that successfully scaled from two machines to over 700 servers. Prior to IMVU, Brett served as the Director of Engineering, VP of Operations and General Manager for the virtual world at There.com. Brett was also co-founder and CEO of Asylum Entertainment, a game development company."

You can watch the talk (in two parts) and see the slides above. I'm pretty much sold on what Brett preaches and am thinking of how to implement continuous deployment in my current projects. He says that having little code and process in place puts you at an advantage, though I'm still wondering how to put in the right infrastructure to have all the tests and deployment run as smoothly and automatically as they do (and how much to prioritize this process infrastructure work around other initial start-up goals).

My notes on the talk are below. Overall, I learned a lot and very much enjoyed hearing Brett speak.

Their process:
  • develop a feature increment
  • verify on buildbot
  • commit code to live in production immediately for some set of customers
  • whole process takes 15 min, release about 50 times per day
  • no staging cluster
  • no QA review
Why would you do something like that?
  • most companies develop, release, then pray for customers
  • now, smart companies develop, release, learn, iterate
  • minimize total time through build, measure, learn cycle
Why continuous deployment is good:
  1. release overhead reduces opportunity to iterate
  2. way easier to find regressions/bugs in small batches of commits
  3. fast response times for business opportunities
  4. more turns at bat
  5. book: Principles of Product Development Flow (reducing batch size, lean product development); reducing batch size reduces cycle time, reduces variability in flow, accelerates  feedback, reduces risk, reduces overhead; large batches reduce efficiency, inherently lower motivation and energy, cause  exponential cost and schedule growth, lead to even larger batches; the entire batch is limited by its worst component

Work process:
  1. local tests pass, engineer commits code
  2. lots and lots of tests run
  3. all tests pass [if no, revert commits]
  4. code deployed to % of servers
  5. metrics good [if no, rollback]
  6. code deployed to all servers
  7. metrics still good [if no, rollback]
  8. win

amount of time you need to run test depends on volume of people going through funnel

all work done on trunk (no work on branches)
  • avoids merge conflicts
  • all code gets validated in production immediately to test now
  • at bottom sees actual PHP test files and their status (time to complete, running status, etc.)
  • a tag includes multiple PHP test files
  • tests run before checkin on local sandbox
  • push for being test-driven but let people work how they want to work
  • each person responsible for writing tests for their own code
  • local sandbox test suite running through a web browser
  • checkboxes: stop after last test, pause after failure, run tests in random order, only run selected tests
  • want to make testing as unburdensome as possible
great slide in presentation with sample output of "RunTests" test view which allows filtering tags, turning test on/off, seeing tests that pass, fail, run, skip, wait, etc.


use selenium

continuous integration: they use buildbot, others use hudson, jenkins, atlassian 

bamboo


build servers
  • good screenshot in slides of buildbot view
  • each box represents a server
  • split all the tests up between multiple servers that takes an 8 hour build to be an 8 minute build
  • each server running many tests; they have 40K tests running through test suite
  • having good tests allows new people to start working and new experiments to happen quickly
  • unit tests of code
  • user workflow tests of site UI
  • if code fails in build server, email goes out and immediately the engineer's supposed to revert the code so others can continue to use build server
  • saves and emails output of the test failure

Deployment:
  • code rolled out to cluster
  • a bunch of perl and rsync code
  • symlinks on site
  • keep multiple copies of code
  • process of rolling forward and backward is just changing symlink
hard part: cluster immune system
  • monitors metrics
  • system performance (web services, disk space, DNS, cron, API availability)
  • business performance (various critical actions/functions, graphs, revenue, registrations)
  • use nagios for system and business metrics
  • if metrics bad, do rollback on cluster (changes symlinks back to previous release, blocks further commits, sends email)
  • server push status web page to diagnose rollback and which metrics killed the push
  • one unfortunate thing in the system is false positives due to real variability in business
  • once metrics good, goes out to entire cluster
  • most wait periods: a couple minutes
  • something it's not very good at: catching very small changes that hurt
deployments of deployment system:
  • was manual for a while, hacked together
  • only recently got good test coverage of deployment system (some not even in repository)
  • don't change deployment system that often
aesthetic tests? they don't

everyone emails changes to the change list (basically everyone in company) with before and after state and people can catch problems

they have one monolithic code base

don't have anything that ensures they have test code coverage automatically


Getting Started (story):
  • there were no customers
  • he came in for operational role
  • engineers wrote code and SSH'd in to cluster
  • no auditing, no monitoring
  • would see PHP syntax errors on homepage
  • only 30 customers at that time so didn't matter
  • set the culture of getting stuff out there
  • wrote a nagios check for "are we rendering HTML out to the customer?"
  • if you're writing new code, it should have some coverage (functional easiest at first)
  • commit to making forward progress
new product advice:
  • start w/ sandbox
  • just push
  • ideal time for failures
established product:
  • start w/ production
  • automate deploys. first automate the push. then automate QA.
  • build confidence
new code must have test coverage.

if new code breaks something old, must write test to catch that

expect some hurdles:
  • you will have cluster outages
  • you will spend engineering time on deployment system
  • have culture where failures are looked at as opportunities
  • how do we get excited about never letting this happen again
  • if have blame-searching culture, will have more challenges
scaling:
  • buildbot would go red, and everyone would be blocked
  • when build time 20-30 minutes, bad news
  • problem with intermittent tests
solutions:
  • build isolation [but not solution; didn't need to build this because could get away with faster test runs, buying hardware and virtualization, sorting tests by speed, dependency injection by instead of calling on real DB, just getting data that would be returned, and also built a hypothesis builder, which is like build isolation where you tag code to run on hypothesis builder that does not run on main buildbot and doesn't block anyone if it fails]
  • added a test metrics system that keeps track of success rate and speed (a lot of builds were blocked on slow tests)
  • got build times down to 8 minutes
  • when builds were over 25 minutes, it was a huge cultural issue
flaky tests / intermittent tests have huge costs:
  • disable or ignore the test
  • third-party providers
  • running tests around time and time spans is much more challenging than normal tests (DST, leap years, etc.)
  • state dependency across tests (overnight, keep running tests in random orders until they become red, and then in morning you see which tests are intermittent and can investigate)
  • they run about 40K tests now
  • even with 5 9's of reliability, you get many failures
  • move them from having to fix them when they happen to fixing them on a schedule
  • if buildbot gets a test that runs green once and then red another time, it will mark it as intermittent, start an issue in bug tracker, and allow the build to go through
trickier bits:
  • catching issues that fail slow (SQL selects from growing tables)
  • critical areas cause hard lock-ups (MySQL, memcached)
  • lack of test coverage of older code: not an issue if you start with test coverage
  • outsourcing (different hours, culture, branching, slower integration)
changing schema requires sign off from tech lead (checking indexes, scalability of changes)

added query killer (issues kill statements on long queries; better to have code die than DB to be overloaded and take down everybody)

schema changes on large tables (they use mysql):
  • create a new table
  • do copy on read
  • have background process later migrate the rest of the data
memcache changes require second set of eyes (hard to test on local sandbox)

hard to work with outsourcers who build over several days (impossible to integrate)

build system itself is critical business function; keep metrics on build system (web dashboard of build process)

integration with A/B testing inside the code (nice slide with pseudocode)
  • name the experiment
  • specify initial rollout % or amount of users
  • specify customer branches with percentage weightings of what % should see enhanced versus non-enhanced (e.g., 50% A/B split)
  • helper function that returns which branch a certain customer should see (enhanced or not) and if not yet assigned then to permanently assign [so customer always gets same experience]
  • simple if statement that splits between if user should see test feature or not
  • web page listing all experiments (available to everyone in company)
  • to user % (QA and admin only, 0%, 10%, etc.)
  • closed on status (they have a page that lists experiments that were closed but the code still exists; this allows easy housekeeping to remove unused code after a while)
per-experiment dashboard to see user groups (male, female, etc.), #s, results (highlighted by desired/undesired colors) and p-values

sprints:
  • planned sprint schedule usually not met (outstanding issues, incomplete features, tech review, refactoring)
  • when releases happen every 15 minutes, "planned sprint ends" can be arbitrary
  • changed to just say that the sprint ends when the work is done (but still understand overage reasons)
IMVU culture:
  • first day on job, engineer pushes out to live customers immediately
  • makes people feel empowered
  • hack-week: you can build anything and company provides food and drink
  • if you're convinced something's important for customers, just build it and allowed to release to 1% of customers without approval
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

    Archives

    June 2021
    May 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    July 2020
    April 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    July 2019
    May 2019
    March 2019
    January 2019
    December 2018
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    May 2018
    April 2018
    February 2018
    January 2018
    November 2017
    October 2017
    September 2017
    May 2017
    April 2017
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016
    May 2016
    December 2015
    November 2015
    October 2015
    September 2015
    August 2015
    July 2015
    June 2015
    May 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    April 2014
    March 2014
    February 2014
    January 2014
    December 2013
    November 2013
    October 2013
    September 2013
    August 2013
    July 2013
    June 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013
    December 2012
    November 2012
    October 2012
    September 2012
    August 2012
    July 2012
    June 2012
    May 2012
    April 2012
    March 2012
    February 2012
    January 2012
    December 2011
    November 2011
    October 2011
    September 2011
    August 2011
    July 2011
    June 2011
    May 2011
    April 2011
    March 2011
    February 2011
    January 2011
    December 2010
    November 2010
    October 2010
    September 2010
    August 2010
    July 2010
    June 2010
    May 2010
    April 2010
    March 2010
    February 2010

    Categories

    All
    Angel Investing
    Cacti
    Cars
    China
    Community Service
    Culture
    Design
    Djing
    Dogs
    Education
    Entertainment
    Entrepreneurship
    Family
    Finance
    Food
    Google
    Happiness
    Incentives
    Investment Banking
    Judaism
    Law
    Lighting
    Magic
    Marketing
    Medicine
    Networking
    Nolabound
    Philosophy
    Professionalism
    Psychology
    Reading
    Real Estate
    Religion
    Romance
    Sales
    Science
    Shangri-La
    Social Entrepreneurship
    Social Media
    Sports
    Teams
    Technology
    Travel
    Turtles
    Ucla
    Venture Capital
    Web Services
    Weddings
    Zen

    Subscribe

    RSS Feed

Picture
Picture
  • Home
  • About
  • Interests
    • Angel investing
    • Magic
    • Scuba Diving
  • Blog
  • Contact