Max Mednik

  • Home
  • Résumé
  • Interests
    • DJing and Music
      • Reading
        • Photography and Film
          • Psychology and Self-Improvement
            • Spirituality
              • Travel
                • Sports
                  • Romance and Weddings
                    • Magic
                      • Community Service
                        • Scuba Diving
                          • Theatrical Lighting Design
                            • Cirque du Soleil
                            • Contact
                            • Search
                            Notes on Brett Durrett at LeanLA Talk on Continuous Deployment 01/09/2012
                            0 Comments
                             
                            Continuous Deployment at Lean LA
                            View more presentations from Brett Durrett
                            Another awesome talk by the guys at LeanLA and IMVU!

                            Here's the blurb about the talk and the really knowledgeable speaker:

                            "Continuous Deployment takes continuous integration one-step further, where every commit goes live to production servers. When this process is described it is frequently met with skepticism around site reliability and the ability to scale a business this way, but it works, it scales (with challenges) and it is embraced by the entire organization. IMVU is a leader in Continuous Deployment, with over 5 years of experience scaling this process to support a technical staff of 50 and a business of more that $40 million in annual revenue. Brett G. Durrett, Vice President of Engineering & Operations for IMVU explains the basic mechanics of Continuous Deployment and discusses the value it creates for the entire company. Specific topics that will be covered: Attendees will understand that releasing to customers 20+ times per day is possible and that it does scale from individual developers to large companies. In addition, they will understand how they can make Continuous Deployment successful at their company, from both a technology and cultural standpoint.

                            Brett G. Durrett has over 20 years experience leading development of software and systems ranging from large-scale Internet services to video games. He serves as VP of Engineering at IMVU where he leads the engineering and technical operations teams and was responsible for the operations infrastructure that successfully scaled from two machines to over 700 servers. Prior to IMVU, Brett served as the Director of Engineering, VP of Operations and General Manager for the virtual world at There.com. Brett was also co-founder and CEO of Asylum Entertainment, a game development company."

                            You can watch the talk (in two parts) and see the slides above. I'm pretty much sold on what Brett preaches and am thinking of how to implement continuous deployment in my current projects. He says that having little code and process in place puts you at an advantage, though I'm still wondering how to put in the right infrastructure to have all the tests and deployment run as smoothly and automatically as they do (and how much to prioritize this process infrastructure work around other initial start-up goals).

                            My notes on the talk are below. Overall, I learned a lot and very much enjoyed hearing Brett speak.

                            Their process:
                            • develop a feature increment
                            • verify on buildbot
                            • commit code to live in production immediately for some set of customers
                            • whole process takes 15 min, release about 50 times per day
                            • no staging cluster
                            • no QA review
                            Why would you do something like that?
                            • most companies develop, release, then pray for customers
                            • now, smart companies develop, release, learn, iterate
                            • minimize total time through build, measure, learn cycle
                            Why continuous deployment is good:
                            1. release overhead reduces opportunity to iterate
                            2. way easier to find regressions/bugs in small batches of commits
                            3. fast response times for business opportunities
                            4. more turns at bat
                            5. book: Principles of Product Development Flow (reducing batch size, lean product development); reducing batch size reduces cycle time, reduces variability in flow, accelerates  feedback, reduces risk, reduces overhead; large batches reduce efficiency, inherently lower motivation and energy, cause  exponential cost and schedule growth, lead to even larger batches; the entire batch is limited by its worst component

                            Work process:
                            1. local tests pass, engineer commits code
                            2. lots and lots of tests run
                            3. all tests pass [if no, revert commits]
                            4. code deployed to % of servers
                            5. metrics good [if no, rollback]
                            6. code deployed to all servers
                            7. metrics still good [if no, rollback]
                            8. win

                            amount of time you need to run test depends on volume of people going through funnel

                            all work done on trunk (no work on branches)
                            • avoids merge conflicts
                            • all code gets validated in production immediately to test now
                            • at bottom sees actual PHP test files and their status (time to complete, running status, etc.)
                            • a tag includes multiple PHP test files
                            • tests run before checkin on local sandbox
                            • push for being test-driven but let people work how they want to work
                            • each person responsible for writing tests for their own code
                            • local sandbox test suite running through a web browser
                            • checkboxes: stop after last test, pause after failure, run tests in random order, only run selected tests
                            • want to make testing as unburdensome as possible
                            great slide in presentation with sample output of "RunTests" test view which allows filtering tags, turning test on/off, seeing tests that pass, fail, run, skip, wait, etc.


                            use selenium

                            continuous integration: they use buildbot, others use hudson, jenkins, atlassian 

                            bamboo


                            build servers
                            • good screenshot in slides of buildbot view
                            • each box represents a server
                            • split all the tests up between multiple servers that takes an 8 hour build to be an 8 minute build
                            • each server running many tests; they have 40K tests running through test suite
                            • having good tests allows new people to start working and new experiments to happen quickly
                            • unit tests of code
                            • user workflow tests of site UI
                            • if code fails in build server, email goes out and immediately the engineer's supposed to revert the code so others can continue to use build server
                            • saves and emails output of the test failure

                            Deployment:
                            • code rolled out to cluster
                            • a bunch of perl and rsync code
                            • symlinks on site
                            • keep multiple copies of code
                            • process of rolling forward and backward is just changing symlink
                            hard part: cluster immune system
                            • monitors metrics
                            • system performance (web services, disk space, DNS, cron, API availability)
                            • business performance (various critical actions/functions, graphs, revenue, registrations)
                            • use nagios for system and business metrics
                            • if metrics bad, do rollback on cluster (changes symlinks back to previous release, blocks further commits, sends email)
                            • server push status web page to diagnose rollback and which metrics killed the push
                            • one unfortunate thing in the system is false positives due to real variability in business
                            • once metrics good, goes out to entire cluster
                            • most wait periods: a couple minutes
                            • something it's not very good at: catching very small changes that hurt
                            deployments of deployment system:
                            • was manual for a while, hacked together
                            • only recently got good test coverage of deployment system (some not even in repository)
                            • don't change deployment system that often
                            aesthetic tests? they don't

                            everyone emails changes to the change list (basically everyone in company) with before and after state and people can catch problems

                            they have one monolithic code base

                            don't have anything that ensures they have test code coverage automatically


                            Getting Started (story):
                            • there were no customers
                            • he came in for operational role
                            • engineers wrote code and SSH'd in to cluster
                            • no auditing, no monitoring
                            • would see PHP syntax errors on homepage
                            • only 30 customers at that time so didn't matter
                            • set the culture of getting stuff out there
                            • wrote a nagios check for "are we rendering HTML out to the customer?"
                            • if you're writing new code, it should have some coverage (functional easiest at first)
                            • commit to making forward progress
                            new product advice:
                            • start w/ sandbox
                            • just push
                            • ideal time for failures
                            established product:
                            • start w/ production
                            • automate deploys. first automate the push. then automate QA.
                            • build confidence
                            new code must have test coverage.

                            if new code breaks something old, must write test to catch that

                            expect some hurdles:
                            • you will have cluster outages
                            • you will spend engineering time on deployment system
                            • have culture where failures are looked at as opportunities
                            • how do we get excited about never letting this happen again
                            • if have blame-searching culture, will have more challenges
                            scaling:
                            • buildbot would go red, and everyone would be blocked
                            • when build time 20-30 minutes, bad news
                            • problem with intermittent tests
                            solutions:
                            • build isolation [but not solution; didn't need to build this because could get away with faster test runs, buying hardware and virtualization, sorting tests by speed, dependency injection by instead of calling on real DB, just getting data that would be returned, and also built a hypothesis builder, which is like build isolation where you tag code to run on hypothesis builder that does not run on main buildbot and doesn't block anyone if it fails]
                            • added a test metrics system that keeps track of success rate and speed (a lot of builds were blocked on slow tests)
                            • got build times down to 8 minutes
                            • when builds were over 25 minutes, it was a huge cultural issue
                            flaky tests / intermittent tests have huge costs:
                            • disable or ignore the test
                            • third-party providers
                            • running tests around time and time spans is much more challenging than normal tests (DST, leap years, etc.)
                            • state dependency across tests (overnight, keep running tests in random orders until they become red, and then in morning you see which tests are intermittent and can investigate)
                            • they run about 40K tests now
                            • even with 5 9's of reliability, you get many failures
                            • move them from having to fix them when they happen to fixing them on a schedule
                            • if buildbot gets a test that runs green once and then red another time, it will mark it as intermittent, start an issue in bug tracker, and allow the build to go through
                            trickier bits:
                            • catching issues that fail slow (SQL selects from growing tables)
                            • critical areas cause hard lock-ups (MySQL, memcached)
                            • lack of test coverage of older code: not an issue if you start with test coverage
                            • outsourcing (different hours, culture, branching, slower integration)
                            changing schema requires sign off from tech lead (checking indexes, scalability of changes)

                            added query killer (issues kill statements on long queries; better to have code die than DB to be overloaded and take down everybody)

                            schema changes on large tables (they use mysql):
                            • create a new table
                            • do copy on read
                            • have background process later migrate the rest of the data
                            memcache changes require second set of eyes (hard to test on local sandbox)

                            hard to work with outsourcers who build over several days (impossible to integrate)

                            build system itself is critical business function; keep metrics on build system (web dashboard of build process)

                            integration with A/B testing inside the code (nice slide with pseudocode)
                            • name the experiment
                            • specify initial rollout % or amount of users
                            • specify customer branches with percentage weightings of what % should see enhanced versus non-enhanced (e.g., 50% A/B split)
                            • helper function that returns which branch a certain customer should see (enhanced or not) and if not yet assigned then to permanently assign [so customer always gets same experience]
                            • simple if statement that splits between if user should see test feature or not
                            • web page listing all experiments (available to everyone in company)
                            • to user % (QA and admin only, 0%, 10%, etc.)
                            • closed on status (they have a page that lists experiments that were closed but the code still exists; this allows easy housekeeping to remove unused code after a while)
                            per-experiment dashboard to see user groups (male, female, etc.), #s, results (highlighted by desired/undesired colors) and p-values

                            sprints:
                            • planned sprint schedule usually not met (outstanding issues, incomplete features, tech review, refactoring)
                            • when releases happen every 15 minutes, "planned sprint ends" can be arbitrary
                            • changed to just say that the sprint ends when the work is done (but still understand overage reasons)
                            IMVU culture:
                            • first day on job, engineer pushes out to live customers immediately
                            • makes people feel empowered
                            • hack-week: you can build anything and company provides food and drink
                            • if you're convinced something's important for customers, just build it and allowed to release to 1% of customers without approval
                             


                            Comments




                            Leave a Reply

                              About Max Mednik

                              Max is an avid entrepreneur and student of life. He is a graduate of Stanford and founder of Ridacto and AMA Capital. He is a member of the business school class of 2012 at UCLA Anderson. He lives in Los Angeles with his family and spends his free time enjoying his many hobbies and interests.

                              Picture

                              Archives

                              January 2012
                              December 2011
                              November 2011
                              October 2011
                              September 2011
                              August 2011
                              July 2011
                              June 2011
                              May 2011
                              April 2011
                              March 2011
                              February 2011
                              January 2011
                              December 2010
                              November 2010
                              October 2010
                              September 2010
                              August 2010
                              July 2010
                              June 2010
                              May 2010
                              April 2010
                              March 2010
                              February 2010

                              Categories

                              All
                              Cacti
                              Culture
                              Design
                              Djing
                              Dogs
                              Education
                              Entertainment
                              Entrepreneurship
                              Family
                              Finance
                              Food
                              Happiness
                              Incentives
                              Investment Banking
                              Judaism
                              Law
                              Lighting
                              Magic
                              Marketing
                              Medicine
                              Networking
                              Philosophy
                              Professionalism
                              Psychology
                              Reading
                              Real Estate
                              Religion
                              Romance
                              Sales
                              Shangri La
                              Social Entrepreneurship
                              Social Media
                              Sports
                              Teams
                              Technology
                              Travel
                              Turtles
                              Ucla
                              Venture Capital
                              Web Services
                              Weddings
                              Zen

                              Subscribe

                              RSS Feed

                              Connect

                              Follow Me on Twitter

                              View my profile on LinkedIn
                              Picture
                              Connect on Facebook
                               
                              View teknikdj's profile on slideshare
                               
                              Subscribe to me on YouTube
                               


                               

                              Shazam Tags