Skip to main content

Posts about mozilla (old posts, page 8)

PyCon 2016 report

I had the opportunity to spend last week in Portland for PyCon 2016. I'd like to share some of my thoughts and some pointers to good talks I was able to attend. The full schedule can be found here and all the videos are here.

Monday

Brandon Rhodes' Welcome to PyCon was one of the best introductions to a conference I've ever seen. Unfortunately I can't find a link to a recording... What I liked about it was that he made everyone feel very welcome to PyCon and to Portland. He explained some of the simple (but important!) practical details like where to find the conference rooms, how to take transit, etc. He noted that for the first time, they have live transcriptions of the talks being done and put up on screens beside the speaker slides for the hearing impaired.

He also emphasized the importance of keeping questions short during Q&A after the regular sessions. "Please form your question in the form of a question." I've been to way too many Q&A sessions where the person asking the question took the opportunity to go off on a long, unrelated tangent. For the most part, this advice was followed at PyCon: I didn't see very many long winded questions or statements during Q&A sessions.

Machete-mode Debugging

(abstract; video)

Ned Batchelder gave this great talk about using python's language features to debug problematic code. He ran through several examples of tricky problems that could come up, and how to use things like monkey patching and the debug trace hook to find out where the problem is. One piece of advice I liked was when he said that it doesn't matter how ugly the code is, since it's only going to last 10 minutes. The point is the get the information you need out of the system the easiest way possible, and then you can undo your changes.

Refactoring Python

(abstract; video)

I found this session pretty interesting. We certainly have lots of code that needs refactoring!

Security with object-capabilities

(abstract; video; slides)

I found this interesting, but a little too theoretical. Object capabilities are a completely orthogonal way to access control lists as a way model security and permissions. It was hard for me to see how we could apply this to the systems we're building.

Awaken your home

(abstract; video)

A really cool intro to the Home Assistant project, which integrates all kinds of IoT type things in your home. E.g. Nest, Sonos, IFTTT, OpenWrt, light bulbs, switches, automatic sprinkler systems. I'm definitely going to give this a try once I free up my raspberry pi.

Finding closure with closures

(abstract; video)

A very entertaining session about closures in Python. Does Python even have closures? (yes!)

Life cycle of a Python class

(abstract; video)

Lots of good information about how classes work in Python, including some details about meta-classes. I think I understand meta-classes better after having attended this session. I still don't get descriptors though!

(I hope Mike learns soon that __new__ is pronounced "dunder new" and not "under under new"!)

Deep learning

(abstract; video)

Very good presentation about getting started with deep learning. There are lots of great libraries and pre-trained neural networks out there to get started with!

Building protocol libraries the right way

(abstract; video)

I really enjoyed this talk. Cory Benfield describes the importance of keeping a clean separation between your protocol parsing code, and your IO. It not only makes things more testable, but makes code more reusable. Nearly every HTTP library in the Python ecosystem needs to re-implement its own HTTP parsing code, since all the existing code is tightly coupled to the network IO calls.

Tuesday

Guido's Keynote

(video)

Some interesting notes in here about the history of Python, and a look at what's coming in 3.6.

Click

(abstract; video)

An intro to the click module for creating beautiful command line interfaces.

I like that click helps you to build testable CLIs.

HTTP/2 and asynchronous APIs

(abstract; video)

A good introduction to what HTTP/2 can do, and why it's such an improvement over HTTP/1.x.

Remote calls != local calls

(abstract; video)

Really good talk about failing gracefully. He covered some familiar topics like adding timeouts and retries to things that can fail, but also introduced to me the concept of circuit breakers. The idea with a circuit breaker is to prevent talking to services you know are down. For example, if you have failed to get a response from service X the past 5 times due to timeouts or errors, then open the circuit breaker for a set amount of time. Future calls to service X from your application will be intercepted, and will fail early. This can avoid hammering a service while it's in an error state, and works well in combination with timeouts and retries of course.

I was thinking quite a bit about Ben's redo module during this talk. It's a great module for handling retries!

Diving into the wreck

(abstract; video)

A look into diagnosing performance problems in applications. Some neat tools and techniques introduced here, but I felt he blamed the DB a little too much :)

Wednesday

Magic Wormhole

(abstract; video; slides)

I didn't end up going to this talk, but I did have a chance to chat with Brian before. magic-wormhole is a tool to safely transfer files from one computer to another. Think scp, but without needing ssh keys set up already, or direct network flows. Very neat tool!

Computational Physics

(abstract; video)

How to do planetary orbit simulations in Python. Pretty interesting talk, he introduced me to Feynman, and some of the important characteristics of the simulation methods introduced.

Small batch artisinal bots

(abstract; video)

Hilarious talk about building bots with Python. Definitely worth watching, although unfortunately it's only a partial recording.

Gilectomy

(abstract; video)

The infamous GIL is gone! And your Python programs only run 25x slower!

Larry describes why the GIL was introduced, what it does, and what's involved with removing it. He's actually got a fork of Python with the GIL removed, but performance suffers quite a bit when run without the GIL.

Lars' Keynote

(video)

If you watch only one video from PyCon, watch this. It's just incredible.

MozLando Survival Guide

MozLando is coming!

I thought I would share a few tips I've learned over the years of how to make the most of these company gatherings. These summits or workweeks are always full of awesomeness, but they can also be confusing and overwhelming.

#1 Seek out people

It's great to have a (short!) list of people you'd like to see in person. Maybe somebody you've only met on IRC / vidyo or bugzilla?

Having a list of people you want to say "thank you" in person to is a great way to approach this. Who doesn't like to hear a sincere "thank you" from someone they work with?

#2 Take advantage of increased bandwidth

I don't know about you, but I can find it pretty challenging at times to get my ideas across in IRC or on an etherpad. It's so much easier in person, with a pad of paper or whiteboard in front of you. You can share ideas with people, and have a latency/lag-free conversation! No more fighting AV issues!

#3 Don't burn yourself out

A week of full days of meetings, code sprints, and blue sky dreaming can be really draining. Don't feel bad if you need to take a breather. Go for a walk or a jog. Take a nap. Read a book. You'll come back refreshed, and ready to engage again.

That's it!

I look forward to seeing you all next week!

Firefox builds on the Taskcluster Index

RIP FTP?

You have have heard rumblings that FTP is going away...

61319299.jpg

Over the past few quarters we've been working to migrate our infrastructure off of the ageing "FTP" [1] system to Amazon S3.

We've maintained some backwards compatibility for the time being [2], so that current Firefox CI and release builds are still available via ftp.mozilla.org, or preferably, archive.mozilla.org since we don't support the ftp protocol any more!

Our long term plan is to make the builds available via the Taskcluster Index, and stop uploading builds to archive.mozilla.org

How do I find my builds???

65722041.jpg

This is pretty big change, but we really think this will make it easier to find the builds you're looking for.

The Taskcluster Index allows us to attach multiple "routes" to a build job. Think of a route as a kind of hierarchical tag, or directory. Unlike regular directories, a build can be tagged with multiple routes, for example, according to the revision or buildid used.

A great tool for exploring the Taskcluster Index is the Indexed Artifact Browser

Here are some recent examples of nightly Firefox builds:

The latest win64 nightly Firefox build is available via the

gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt route

This same build (as of this writing) is also available via its revision:

gecko.v2.mozilla-central.nightly.revision.47b49b0d32360fab04b11ff9120970979c426911.firefox.win64-opt

Or the date:

gecko.v2.mozilla-central.nightly.2015.11.27.latest.firefox.win64-opt

The artifact browser is simply an interface on top of the index API. Using this API, you can also fetch files directly using wget, curl, python requests, etc.:

https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt/artifacts/public/build/firefox-45.0a1.en-US.win64.installer.exe [3]

Similar routes exist for other platforms, for B2G and mobile, and for opt/debug variations. I encourage you to explore the gecko.v2 namespace, and see if it makes things easier for you to find what you're looking for! [4]

Can't find what you want in the index? Please let us know!

MozFest 2015

I had the privilege of attending MozFest last week. Overall it was a really great experience. I met lots of really wonderful people, and learned about so many really interesting and inspiring projects.

My biggest takeaway from MozFest was how important it is to provide good APIs and data for your systems. You can't predict how somebody else will be able to make use of your data to create something new and wonderful. But if you're not making your data available in a convenient way, nobody can make use of it at all!

It was a really good reminder for me. We generate a lot of data in Release Engineering, but it's not always exposed in a way that's convenient for other people to make use of.

The rest of this post is a summary of various sessions I attended.

Friday

Friday night started with a Science Fair. Lots of really interesting stuff here. Some of the projects that stood out for me were:

  • naturebytes - a DIY wildlife camera based on the raspberry pi, with an added bonus of aiding conservation efforts.

  • histropedia - really cool visualizations of time lines, based on data in Wikipedia and Wikidata. This was the first time I'd heard of Wikidata, and the possibilities were very exciting to me! More on this later, as I attended a whole session on Wikidata.

  • Several projects related to the Internet-of-Things (IOT)

Saturday

On Saturday, the festival started with some keynotes. Surman spoke about how MozFest was a bit chaotic, but this was by design. In a similar way that the web is an open platform that you can use as a platform for building your own ideas, MozFest should be an open platform so you can meet, brainstorm, and work on your ideas. This means it can seem a bit disorganized, but that's a good thing :) You get what you want out of it.

I attended several good sessions on Saturday as well:

  • Ending online tracking. We discussed various methods currently used to track users, such as cookies and fingerprinting, and what can be done to combat these. I learned, or re-learned, about a few interesting Firefox extensions as a result:

    • privacybadger. Similar to Firefox's tracking protection, except it doesn't rely on a central blacklist. Instead, it tries to automatically identify third party domains that are setting cookies, etc. across multiple websites. Once identified, these third party domains are blocked.

    • https everywhere. Makes it easier to use HTTPS by default everywhere.

  • Intro to D3JS. d3js is a JS data visualization library. It's quite powerful, but something I learned is that you're expected to do quite a bit of work up-front to make sure it's showing you the things you want. It's not great as a data exploration library, where you're not sure exactly what the data means, and want to look at it from different points of view. The nvd3 library may be more suitable for first time users.

  • 6 kitchen cases for IOT We discussed the proposed IOT design manifesto briefly, and then split up into small groups to try and design a product, using the principles outlined in the manifesto. Our group was tasked with designing some product that would help connect hospitals with amateur chefs in their local area, to provide meals for patients at the hospital. We ended up designing a "smart cutting board" with a built in display, that would show you your recipes as you prepared them, but also collect data on the frequency of your meal preparations, and what types of foods you were preparing.

    Going through the exercise of evaluating the product with each of the design principles was fun. You could be pretty evil going into this and try and collect all customer data :)

Sunday

  • How to fight an internet shutdown - we role played how we would react if the internet was suddenly shut down during some political protests. What kind of communications would be effective? What kind of preparation can you have done ahead of time for such an event?

    This session was run by Deji from accessnow. It was really eye opening to see how internet shutdowns happen fairly regularly around the world.

  • Data is beaufitul Introduction to wikidata Wikidata is like Wikipedia, but for data. An open database of...stuff. Anybody can edit and query the database. One of the really interesting features of Wikidata is that localization is kind of built-in as part of the design. Each item in the database is assigned an id (prefixed by "Q"). E.g. Q42 is Douglas Adams. The description for each item is simply a table of locale -> localized description. There's no inherent bias towards English, or any other language. The beauty of this is that you can reference the same piece of data from multiple languages, only having to focus on localizing the various descriptions. You can imagine different translations of the same Wikipedia page right now being slightly inconsistent due to each one having to be updated separately. If they could instead reference the data in Wikidata, then there's only one place to update the data, and all the other places that reference that data would automatically benefit from it.

    The query language is quite powerful as well. A simple demonstration was "list all the works of art in the same room in the Louvre as the Mona Lisa."

    It really got me thinking about how powerful open data is. How can we in Release Engineering publish our data so others can build new, interesting and useful tools on top of it?

  • Local web Various options for purely local web / networks were discussed. There are some interesting mesh network options available commotion was demo'ed. These kind of distributions give you file exchange, messaging, email, etc. on a local network that's not necessarily connected to the internet.

RelEng 2015 (Part 2)

This the second part of my report on the RelEng 2015 conference that I attended a few weeks ago. Don't forget to read part 1!

I've put up links to the talks whenever I've been able to find them.

Defect Prediction

Davide Giacomo Cavezza gave a presentation about his research into defect prediction in rapidly evolving software. The question he was trying to answer was if it's possible to predict if any given commit is defective. He compared static models against dynamic models that adapt over the lifetime of a project. The models look at various metrics about the commit such as:

  • Number of files the commit is changing

  • Entropy of the changes

  • The amount of time since the files being changed were last modified

  • Experience of the developer

His simulations showed that a dynamic model yields superior results to a static model, but that even a dynamic model doesn't give sufficiently accurate results.

I wonder about adopting something like this at Mozilla. Would a warning that a given commit looks particularly risky be a useful signal from automation?

Continuous Deployment and Schema Evolution

Michael de Jong spoke about a technique for coping with SQL schema evolution in a continuously deployed applications. There are a few major issues with changing SQL schemas in production including:

  • schema changes typically block other reads/writes from occurring

  • application logic needs to be synchronized with the schema change. If the schema change takes non-trivial time, which version of the application should be running in the meanwhile?

Michael's solution was to essentially create forks of all tables whenever the schema changes, and to identify each by a unique version. Applications need to specify which version of the schema they're working with. There's a special DB layer that manages the double writes to both versions of the schema that are live at the same time.

Securing a Deployment Pipeline

Paul Rimba spoke about security deployment pipelines.

One of the areas of focus was on the "build server". He started with the observation that a naive implementation of a Jenkins infrastructure has the master and worker on the same machine, and so the workers have the opportunity to corrupt the state of the master, or of other job types' workspaces. His approach to hardening this part of the pipeline was to isolate each job type into its own docker worker. He also mentioned using a microservices architecture to isolate each task into its own discrete component - easier to secure and reason about.

We've been moving this direction at Mozilla for our build infrastructure as well, so it's good to see our decision corroborated!

Makefile analysis

Shurui Zhou presented some data about extracting configuration data from build systems. Her goal was to determine if all possible configurations of a system were valid.

The major problem here is make (of course!). Expressions such as $(shell uname -r) make static analysis impossible.

We had a very good discussion about build systems following this, and how various organizations have moved themselves away from using make. Somebody described these projects as requiring a large amount of "activation energy". I thought this was a very apt description: lots of upfront cost for little visible change.

However, people unanimously agreed that make was not optimal, and that the benefits of moving to a more modern system were worth the effort.

Yak Shaving

Benjamin Smedberg wrote about shaving yaks a few weeks ago. This is an experience all programmers (and probably most other professions!) encounter regularly, and yet it still baffles and frustrates us!

Just last week I got tangled up in a herd of yaks.

I started the week wanting to deploy the new bundleclone extension that :gps has been working on. If you're not familiar with bundleclone, it's a combination of server and client side extensions to mercurial that allow the server to advertise an externally hosted static bundle to clients requesting full clones. hg.mozilla.org advertises bundles hosted on S3 for certain repositories. This can significantly reduce load on the mercurial servers, something that has caused us problems in the past!

With the initial puppet patch written and sanity-checked, I wanted to verify it worked on other platforms before deploying it more widely.

Linux...looks ok. I was able to make some minor tweaks to existing puppet modules to make testing a bit easier as well. Always good to try and leave things in better shape for the next person!

OSX...looks ok...er, wait! All our test jobs are re-cloning tools and mozharness for every job! What's going on? I'm worried that we could be paying for unnecessary S3 bandwidth if we don't fix this.

back-to-bugzilla.jpg

I find bug 1103700 where we started deploying and relying on machine-wide caches of mozharness and tools. Unfortunately this wasn't done on OSX yet. Ok, time to file a new bug and start a new puppet branch to get these shared repositories working on OSX. The patch is pretty straightforward, and I start testing it out.

Because these tasks run at boot as part of runner, I need to be watching logs on the test machines very early on in the bootup process. Hmmm, the machine seems really sluggish when I'm trying to interact with it right after startup. A quick glance at top shows something fishy:

fseventsd.png

There's a system daemon called fseventsd taking a substantial amount of system resources! I wonder if this is causing any impact to build times. After filing a new bug, I do a few web searches to see if this service can be disabled. Most of the articles I find suggest to write a no_log file into the volume's .fseventsd directory to disable fseventds from writing out event logs for the volume. Time for another puppet branch!.

And that's the current state of this project! I'm currently running testing the no_log fix to see if it impacts build times, or if it causes other problems on the machines. I'm hoping to deploy the new runner tasks this week, and switch our buildbot-configs and mozharness configs to make use of them.

And finally, the first yak I started shaving could be the last one I finish shaving; I'm hoping to deploy the bundleclone extension this week as well.

RelEng 2015 (part 1)

Last week, I had the opportunity to attend RelEng 2015 - the 3rd International Workshop of Release Engineering. This was a fantastic conference, and I came away with lots of new ideas for things to try here at Mozilla.

I'd like to share some of my thoughts and notes I took about some of the sessions. As of yet, the speakers' slides aren't collected or linked to from the conference website. Hopefully they'll get them up soon! The program and abstracts are available here.

For your sake (and mine!) I've split up my notes into a few separate posts. This post covers the introduction and keynote.

tl;dr

"Continuous deployment" of web applications is basically a solved problem today. What remains is for organizations to adopt best practices. Mobile/desktop applications remain a challenge.

Cisco relies heavily on collecting and analyzing metrics to inform their approach to software development. Statistically speaking, quality is the best driver of customer satisfaction. There are many aspects to product quality, but new lines of code introduced per release gives a good predictor of how many new bugs will be introduced. It's always challenging to find enough resources to focus on software quality; being able to correlate quality to customer satisfaction (and therefore market share, $$$) is one technique for getting organizational support for shipping high quality software. Common release criteria such as bugs found during testing, or bug fix rate, are used to inform stakeholders as to the quality of the release.

Introductory Session

Bram Adams and Foutse Khomh kicked things off with an overview of "continuous deployment" over the last 5 years. Back in 2009 we were already talking about systems where pushing to version control would trigger tens of thousands of tests, and do canary deployments up to 50 times a day.

Today we see companies like Facebook demonstrating that continuous deployment of web applications is basically a solved problem. Many organizations are still trying to implement these techniques. Mobile [and desktop!] applications still present a challenge.

Keynote

Pete Rotella from Cisco discussed how he and his team measured and predicted release quality for various projects at Cisco. His team is quite focused on data and analytics.

Cisco has relatively long release cycles compared to what we at Mozilla are used to now. They release 2-3 times per year, with each release representing approximately 500kloc of new code. Their customers really like predictable release cycles, and also don't like releases that are too frequent. Many of their customers have their own testing / validation cycles for releases, and so are only willing to update for something they deem critical.

Pete described how he thought software projects had four degrees of freedom in which to operate, and how quality ends up being the one sacrificed most often in order to compensate for constraints in the others:

  • resources (people / money): It's generally hard to hire more people or find room in the budget to meet the increasing demands of customers. You also run into the mythical man month problem by trying to throw more people at a problem.

  • schedule (time): Having standard release cycles means organizations don't usually have a lot of room to push out the schedule so that features can be completed properly.

    I feel that at Mozilla, the rapid release cycle has helped us out to some extent here. The theory is that if your feature isn't ready for the current version, it can wait for the next release which is only 6 weeks behind. However, I do worry that we have too many features trying to get finished off in aurora or even in beta.

  • content (features): Another way to get more room to operate is to cut features. However, it's generally hard to cut content or features, because those are what customers are most interested in.

  • quality: Pete believes this is where most organizations steal resources for to make up for people/schedule/content constraints. It's a poor long-term play, and despite "quality is our top priority" being the Official Party Line, most organizations don't invest enough here. What's working against quality?

    • plethora of releases: lots of projects / products / special requests for releases. Attempts to reduce the # of releases have failed on most occasions.

    • monetization of quality is difficult. Pete suggests tying the cost of a poor quality release to this. How many customers will we lose with a buggy release?

    • having RelEng and QA embedded in Engineering teams is a problem; they should be independent organizations so that their recommendations can have more weight.

    • "control point exceptions" are common. e.g. VP overrides recommendations of QA / RelEng and ships the release.

Why should we focus on quality? Pete's metrics show that it's the strongest driver of customer satisfaction. Your product's customer satisfaction needs to be more than 4.3/5 to get more than marginal market share.

How can RelEng improve metrics?

  • simple dashboards

  • actionable metrics - people need to know how to move the needle

  • passive - use existing data. everybody's stretched thin, so requiring other teams to add more metadata for your metrics isn't going to work.

  • standardized quality metrics across the company

  • informing engineering teams about risk

  • correlation with customer experience.

Interestingly, handling the backlog of bugs has minimal impact on customer satisfaction. In addition, there's substantial risk introduced whenever bugs are fixed late in a release cycle. There's an exponential relationship between new lines of code added and # of defects introduced, and therefore customer satisfaction.

Another good indicator of customer satisfaction is the number of "Customer found defects" - i.e. the number of bugs found and reported by their customers vs. bugs found internally.

Pete's data shows that if they can find more than 80% of the bugs in a release prior to it being shipped, then the remaining bugs are very unlikely to impact customers. He uses lines of code added for previous releases, and historical bug counts per version to estimate number of bugs introduced in the current version given the new lines of code added. This 80% figure represents one of their "Release Criteria". If less than 80% of predicted bugs have been found, then the release is considered risky.

Another "Release Criteria" Pete discussed was the weekly rate of fixing bugs. Data shows that good quality releases have the weekly bug fix rate drop to 43% of the maximum rate at the end of the testing cycle. This data demonstrates that changes late in the cycle have a negative impact on software quality. You really want to be fixing fewer and fewer bugs as you get closer to release.

I really enjoyed Pete's talk! There are definitely a lot of things to think about, and how we might apply them at Mozilla.

RelEng Retrospective - Q1 2015

RelEng had a great start to 2015. We hit some major milestones on projects like Balrog and were able to turn off some old legacy systems, which is always an extremely satisfying thing to do!

We also made some exciting new changes to the underlying infrastructure, got some projects off the drawing board and into production, and drastically reduced our test load!

Firefox updates

Balrog

balrog

All Firefox update queries are now being served by Balrog! Earlier this year, we switched all Firefox update queries off of the old update server, aus3.mozilla.org, to the new update server, codenamed Balrog.

Already, Balrog has enabled us to be much more flexible in handling updates than the previous system. As an example, in bug 1150021, the About Firefox dialog was broken in the Beta version of Firefox 38 for users with RTL locales. Once the problem was discovered, we were able to quickly disable updates just for those users until a fix was ready. With the previous system it would have taken many hours of specialized manual work to disable the updates for just these locales, and to make sure they didn't get updates for subsequent Betas.

Once we were confident that Balrog was able to handle all previous traffic, we shut down the old update server (aus3). aus3 was also one of the last systems relying on CVS (!! I know, rite?). It's a great feeling to be one step closer to axing one more old system!

Funsize

When we started the quarter, we had an exciting new plan for generating partial updates for Firefox in a scalable way.

Then we threw out that plan and came up with an EVEN MOAR BETTER plan!

The new architecture for funsize relies on Pulse for notifications about new nightly builds that need partial updates, and uses TaskCluster for doing the generation of the partials and publishing to Balrog.

The current status of funsize is that we're using it to generate partial updates for nightly builds, but not published to the regular nightly update channel yet.

There's lots more to say here...stay tuned!

FTP & S3

Brace yourselves... ftp.mozilla.org is going away...

brace yourselves...ftp is going away

...in its current incarnation at least.

Expect to hear MUCH more about this in the coming months.

tl;dr is that we're migrating as much of the Firefox build/test/release automation to S3 as possible.

The existing machinery behind ftp.mozilla.org will be going away near the end of Q3. We have some ideas of how we're going to handle migrating existing content, as well as handling new content. You should expect that you'll still be able to access nightly and CI Firefox builds, but you may need to adjust your scripts or links to do so.

Currently we have most builds and tests doing their transfers to/from S3 via the task cluster index in addition to doing parallel uploads to ftp.mozilla.org. We're aiming to shut off most uploads to ftp this quarter.

Please let us know if you have particular systems or use cases that rely on the current host or directory structure!

Release build promotion

Our new Firefox release pipeline got off the drawing board, and the initial proof-of-concept work is done.

The main idea here is to take an existing build based on a push to mozilla-beta, and to "promote" it to a release build. So we need to generate all the l10n repacks, partner repacks, generate partial updates, publish files to CDNs, etc.

The big win here is that it cuts our time-to-release nearly in half, and also simplifies our codebase quite a bit!

Again, expect to hear more about this in the coming months.

Infrastructure

In addition to all those projects in development, we also tackled quite a few important infrastructure projects.

OSX test platform

10.10 is now the most widely used Mac platform for Firefox, and it's important to test what our users are running. We performed a rolling upgrade of our OS X testing environment, migrating from 10.8 to 10.10 while spending nearly zero capital, and with no downtime. We worked jointly with the Sheriffs and A-Team to green up all the tests, and shut coverage off on the old platform as we brought it up on the new one. We have a few 10.8 machines left riding the trains that will join our 10.10 pool with the release of ESR 38.1.

Got Windows builds in AWS

We saw the first successful builds of Firefox for Windows in AWS this quarter as well! This paves the way for greater flexibility, on-demand burst capacity, faster developer prototyping, and disaster recovery and resiliency for windows Firefox builds. We'll be working on making these virtualized instances more performant and being able to do large-scale automation before we roll them out into production.

Puppet on windows

RelEng uses puppet to manage our Linux and OS X infrastructure. Presently, we use a very different tool chain, Active Directory and Group Policy Object, to manage our Windows infrastructure. This quarter we deployed a prototype Windows build machine which is managed with puppet instead. Our goal here is to increase visibility and hackability of our Windows infrastructure. A common deployment tool will also make it easier for RelEng and community to deploy new tools to our Windows machines.

New Tooltool Features

We've redesigned and deployed a new version of tooltool, the content-addressable store for large binary files used in build and test jobs. Tooltool is now integrated with RelengAPI and uses S3 as a backing store. This gives us scalability and a more flexible permissioning model that, in addition to serving public files, will allow the same access outside the releng network as inside. That means that developers as well as external automation like TaskCluster can use the service just like Buildbot jobs. The new implementation also boasts a much simpler HTTP-based upload mechanism that will enable easier use of the service.

Centralized POSIX System Logging

Using syslogd/rsyslogd and Papertrail, we've set up centralized system logging for all our POSIX infrastructure. Now that all our system logs are going to one location and we can see trends across multiple machines, we've been able to quickly identify and fix a number of previously hard-to-discover bugs. We're planning on adding additional logs (like Windows system logs) so we can do even greater correlation. We're also in the process of adding more automated detection and notification of some easily recognizable problems.

Security work

Q1 included some significant effort to avoid serious security exploits like GHOST, escalation of privilege bugs in the Linux kernel, etc. We manage 14 different operating systems, some of which are fairly esoteric and/or no longer supported by the vendor, and we worked to backport some code and patches to some platforms while upgrading others entirely. Because of the way our infrastructure is architected, we were able to do this with minimal downtime or impact to developers.

API to manage AWS workers

As part of our ongoing effort to automate the loaning of releng machines when required, we created an API layer to facilitate the creation and loan of AWS resources, which was previously, and perhaps ironically, one of the bigger time-sinks for buildduty when loaning machines.

Cross-platform worker for task cluster

Release engineering is in the process of migrating from our stalwart, buildbot-driven infrastructure, to a newer, more purpose-built solution in taskcluster. Many FirefoxOS jobs have already migrated, but those all conveniently run on Linux. In order to support the entire range of release engineering jobs, we need support for Mac and Windows as well. In Q1, we created what we call a "generic worker," essentially a base class that allows us to extend taskcluster job support to non-Linux operating systems.

Testing

Last, but not least, we deployed initial support for SETA, the search for extraneous test automation!

This means we've stopped running all tests on all builds. Instead, we use historical data to determine which tests to run that have been catching the most regressions. Other tests are run less frequently.

Diving into python logging

Python has a very rich logging system. It's very easy to add structured or unstructured log output to your python code, and have it written to a file, or output to the console, or sent to syslog, or to customize the output format.

We're in the middle of re-examining how logging works in mozharness to make it easier to factor-out code and have fewer mixins.

Here are a few tips and tricks that have really helped me with python logging:

There can be only more than one

Well, there can be only one logger with a given name. There is a special "root" logger with no name. Multiple getLogger(name) calls with the same name will return the same logger object. This is an important property because it means you don't need to explicitly pass logger objects around in your code. You can retrieve them by name if you wish. The logging module is maintaining a global registry of logging objects.

You can have multiple loggers active, each specific to its own module or even class or instance.

Each logger has a name, typically the name of the module it's being used from. A common pattern you see in python modules is this:

# in module foo.py
import logging
log = logging.getLogger(__name__)

This works because inside foo.py, __name__ is equal to "foo". So inside this module the log object is specific to this module.

Loggers are hierarchical

The names of the loggers form their own namespace, with "." separating levels. This means that if you have have loggers called foo.bar, and foo.baz, you can do things on logger foo that will impact both of the children. In particular, you can set the logging level of foo to show or ignore debug messages for both submodules.

# Let's enable all the debug logging for all the foo modules
import logging
logging.getLogger('foo').setLevel(logging.DEBUG)

Log messages are like events that flow up through the hierarchy

Let's say we have a module foo.bar:

import logging
log = logging.getLogger(__name__)  # __name__ is "foo.bar" here

def make_widget():
    log.debug("made a widget!")

When we call make_widget(), the code generates a debug log message. Each logger in the hierarchy has a chance to output something for the message, ignore it, or pass the message along to its parent.

The default configuration for loggers is to have their levels unset (or set to NOTSET). This means the logger will just pass the message on up to its parent. Rinse & repeat until you get up to the root logger.

So if the foo.bar logger hasn't specified a level, the message will continue up to the foo logger. If the foo logger hasn't specified a level, the message will continue up to the root logger.

This is why you typically configure the logging output on the root logger; it typically gets ALL THE MESSAGES!!! Because this is so common, there's a dedicated method for configuring the root logger: logging.basicConfig()

This also allows us to use mixed levels of log output depending on where the message are coming from:

import logging

# Enable debug logging for all the foo modules
logging.getLogger("foo").setLevel(logging.DEBUG)

# Configure the root logger to log only INFO calls, and output to the console
# (the default)
logging.basicConfig(level=logging.INFO)

# This will output the debug message
logging.getLogger("foo.bar").debug("ohai!")

If you comment out the setLevel(logging.DEBUG) call, you won't see the message at all.

exc_info is teh awesome

All the built-in logging calls support a keyword called exc_info, which if isn't false, causes the current exception information to be logged in addition to the log message. e.g.:

import logging
logging.basicConfig(level=logging.INFO)

log = logging.getLogger(__name__)

try:
    assert False
except AssertionError:
    log.info("surprise! got an exception!", exc_info=True)

There's a special case for this, log.exception(), which is equivalent to log.error(..., exc_info=True)

Python 3.2 introduced a new keyword, stack_info, which will output the current stack to the current code. Very handy to figure out how you got to a certain point in the code, even if no exceptions have occurred!

"No handlers found..."

You've probably come across this message, especially when working with 3rd party modules. What this means is that you don't have any logging handlers configured, and something is trying to log a message. The message has gone all the way up the logging hierarchy and fallen off the...top of the chain (maybe I need a better metaphor).

import logging
log = logging.getLogger()
log.error("no log for you!")

outputs:

No handlers could be found for logger "root"

There are two things that can be done here:

  1. Configure logging in your module with basicConfig() or similar

  2. Library authors should add a NullHandler at the root of their module to prevent this. See the cookbook and this blog for more details here.

Want more?

I really recommend that you read the logging documentation and cookbook which have a lot more great information (and are also very well written!) There's a lot more you can do, with custom log handlers, different output formats, outputting to many locations at once, etc. Have fun!

Upcoming hotness from RelEng

To kick off the new year, I'd like to share some of the exciting projects we have underway in Release Engineering.

Balrog

First off we have Balrog, our next generation update server. Work on Balrog has been underway for quite some time. Last fall we switched beta users to use it. Shortly after, we did some additional load testing to see if we were ready to flip over release traffic. The load testing revealed some areas that needed optimization, which isn't surprising since almost no optimization work had been done up to that point!

Ben and Nick added the required caching, and our subsequent load testing was a huge success. We're planning on flipping the switch to divert release users over on January 19th. \o/

Funsize

Next up we have Funsize. (Don't ask about the name; it's all Laura's fault). Funsize is a web service to generate partial updates between two versions of Firefox. There are a number of places where we want to generate these partial updates, so wrapping the logic up into a service makes a lot of sense, and also affords the possibility of faster generation due to caching.

We're aiming to have nightly builds use funsize for partial update generation this quarter.

I'd really like to see us get away from the model where the "nightly build" job is responsible for not only the builds, but generating and publishing the complete and partial updates. The problem with this is that the single job is responsible for too many deliverables, and touches too many systems. It's hard to make and test changes in isolation.

The model we're trying to move to is where the build jobs are responsible only for generating the required binaries. It should be the responsibility of a separate system to generate partials and publish updates to users. I believe splitting up these functions into their own systems will allow us to be more flexible in how we work on changes to each piece independently.

S3 uploads from automation

This quarter we're also working on migrating build and test files off our aging file server infrastructure (aka "FTP", which is a bit of a misnomer...) and onto S3. All of our build and test binaries are currently uploaded and downloaded via a central file server in our data center. It doesn't make sense to do this when most of our builds and tests are being generated and consumed inside AWS now. In addition, we can get much better cost-per-GB by moving the storage to S3.

No reboots

Morgan has been doing awesome work with runner. One of the primary aims here is to stop rebooting build and test machines between every job. We're hoping that by not rebooting between builds, we can get a small speedup in build times since a lot of the build tree should be cached in memory already. Also, by not rebooting we can have shorter turnaround times between jobs on a single machine; we can effectively save 3-4 minutes of overhead per job by not rebooting. There's also the opportunity to move lots of machine maintenance work from inside the build/test jobs themselves to instead run before buildbot starts.

Release build promotion

Finally I'd like to share some ideas we have about how to radically change how we do release builds of Firefox.

Our plan is to create a new release pipeline that works with already built binaries and "promotes" them to the release/beta channel. The release pipeline we have today creates a fresh new set of release builds that are distinct from the builds created as part of continuous integration.

This new approach should cut the amount of time required to release nearly in half, since we only need to do one set of builds instead of two. It also has the benefit of aligning the release and continuous-integration pipelines, which should simplify a lot of our code.

... and much more!

This is certainly not an exhaustive list of the things we have planned for this year. Expect to hear more from us over the coming weeks!