A tale of python profiling and optimizing

chris

2013-11-07 19:05

The Release Engineering infrastructure at Mozilla relies heavily on buildbot for much of our infrastructure. For various reasons we're running an older version based on buildbot 0.8.2, which we maintain in our own mercurial repository. We have many different masters with all sorts of different configurations.

To make sure that we don't break things too often, we wrote a tools called test-masters.sh that creates local versions of each unique master configuration and then runs a configuration check on it. Currently there are 20 unique master configurations to test, and it would take 4 minutes to run test-masters.sh on all of them on my machine. Recently sfink landed some changes which would test all the masters in parallel, which brought the time down from a previously interminable 11 minutes.

Four minutes is a long time to wait! What's taking so much time?

My go-to tool for profiling python code is cProfile. I ended up writing a small script to do the equivalent of 'buildbot checkconfig':

import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader

def loadMaster(filename):
    ConfigLoader(configFileName=filename)

cProfile.run("loadMaster(sys.argv[1])", "master.prof")

Running this against my buildbot master's master.cfg file will produce a master.prof file I can load to look at the profile.

>>> import pstats
>>> pstats.Stats("master.prof").sort_stats('time').print_stats(5)
Thu Nov  7 21:42:25 2013    master.prof

         26601516 function calls (24716688 primitive calls) in 439.756 seconds

   Ordered by: internal time
   List reduced from 1997 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  409.381  409.381  438.936  438.936 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py:621(loadConfig)
   170046    3.907    0.000   10.706    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/steps/shell.py:65(__init__)
   222809    3.745    0.000    4.124    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:611(__init__)
        1    3.025    3.025   29.352   29.352 /home/catlee/mozilla/buildbot-configs/tests-master/master.cfg:2(<module>)
   170046    2.385    0.000    6.033    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:1027(__init__)

Looks like buildbot's loadConfig is taking a long time! Unfortunately we don't get any more detail than that from cProfile. To get line-by-line profiling info, I've started using kernprof. This requires a few changes to our setup. First, we don't want to run cProfile any more, so modify our test script like this:

import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader

def loadMaster(filename):
    ConfigLoader(configFileName=filename)

#cProfile.run("loadMaster(sys.argv[1])", "master.prof")
loadMaster(sys.argv[1])

And since we want to get line-by-line profiling of buildbot's loadConfig, we need to annotate buildbot's code with the @profile decorator. In buildbot's master.py, I added @profile to loadConfig so it looks like this now:

@profile
def loadConfig(self, f, check_synchronously_only=False):
    """Internal function to load a specific configuration file. Any
    errors in the file will be signalled by raising an exception.
    <snip>
    """

Now run kernprof.py to get our profile:

kernprof.py -l -v ../profile_master.py master.cfg

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   621                                               @profile
   622                                               def loadConfig(self, f, check_synchronously_only=False):
   623                                                   """Internal function to load a specific configuration file. Any
   624                                                   errors in the file will be signalled by raising an exception.
   625
   626                                                   If check_synchronously_only=True, I will return (with None)
   627                                                   synchronously, after checking the config file for sanity, or raise an
   628                                                   exception. I may also emit some DeprecationWarnings.
   629
   630                                                   If check_synchronously_only=False, I will return a Deferred that
   631                                                   fires (with None) when the configuration changes have been completed.
   632                                                   This may involve a round-trip to each buildslave that was involved."""
   633
   634         1           17     17.0      0.0          localDict = {'basedir': os.path.expanduser(self.basedir)}
   635         1            7      7.0      0.0          try:
   636         1     68783020 68783020.0     12.0              exec f in localDict
   637                                                   except:
   638                                                       log.msg("error while parsing config file")
   639                                                       raise

   <snip>

   785     13303        86781      6.5      0.0          for b in builders:
   786     13302        92920      7.0      0.0              if b.has_key('slavename') and b['slavename'] not in slavenames:
   787                                                           raise ValueError("builder %s uses undefined slave %s" \
   788                                                                            % (b['name'], b['slavename']))
   789   6935914     42782768      6.2      7.5              for n in b.get('slavenames', []):
   790   6922612    449928915     65.0     78.4                  if n not in slavenames:
   791                                                               raise ValueError("builder %s uses undefined slave %s" \
   792                                                                                % (b['name'], n))
   793     13302      2478517    186.3      0.4              if b['name'] in buildernames:

Wow! Line 790 is taking 78% of our runtime! What's going on?

If I look at my config, I see that I have 13,302 builders and 13,988 slaves configured. Each builder has a subset of slaves attached, but we're still doing 6,922,612 checks to see if each slave that's configured for the builder is configured as a top-level slave. slavenames happens to be a list, so each check does a full scan of the list to see if the slave exists or not!

A very simple patch fixes this:

diff --git a/master/buildbot/master.py b/master/buildbot/master.py
--- a/master/buildbot/master.py
+++ b/master/buildbot/master.py
@@ -761,19 +761,19 @@ class BuildMaster(service.MultiService):
         errmsg = "c['schedulers'] must be a list of Scheduler instances"
         assert isinstance(schedulers, (list, tuple)), errmsg
         for s in schedulers:
             assert interfaces.IScheduler(s, None), errmsg
         assert isinstance(status, (list, tuple))
         for s in status:
             assert interfaces.IStatusReceiver(s, None)

-        slavenames = [s.slavename for s in slaves]
+        slavenames = set(s.slavename for s in slaves)

         # convert builders from objects to config dictionaries
         builders_dicts = []
         for b in builders:
             if isinstance(b, BuilderConfig):
                 builders_dicts.append(b.getConfigDict())
             elif type(b) is dict:
                 builders_dicts.append(b)

Now our checks are into a set instead of a list, which is an O(log(n)) operation instead of O(n). Let's re-run our profile with this patch:

File: /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py
Function: loadConfig at line 621
Total time: 109.885 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   621                                               @profile
   622                                               def loadConfig(self, f, check_synchronously_only=False):
   623                                                   """Internal function to load a specific configuration file. Any
   624                                                   errors in the file will be signalled by raising an exception.
   625
   626                                                   If check_synchronously_only=True, I will return (with None)
   627                                                   synchronously, after checking the config file for sanity, or raise an
   628                                                   exception. I may also emit some DeprecationWarnings.
   629
   630                                                   If check_synchronously_only=False, I will return a Deferred that
   631                                                   fires (with None) when the configuration changes have been completed.
   632                                                   This may involve a round-trip to each buildslave that was involved."""
   633
   634         1           30     30.0      0.0          localDict = {'basedir': os.path.expanduser(self.basedir)}
   635         1           13     13.0      0.0          try:
   636         1     46268025 46268025.0     42.1              exec f in localDict
   637                                                   except:
   638                                                       log.msg("error while parsing config file")
   639                                                       raise

   <snip>

   785     13303        56199      4.2      0.1          for b in builders:
   786     13302        60438      4.5      0.1              if b.has_key('slavename') and b['slavename'] not in slavenames:
   787                                                           raise ValueError("builder %s uses undefined slave %s" \
   788                                                                            % (b['name'], b['slavename']))
   789   6935914     27430456      4.0     25.0              for n in b.get('slavenames', []):
   790   6922612     29028880      4.2     26.4                  if n not in slavenames:
   791                                                               raise ValueError("builder %s uses undefined slave %s" \
   792                                                                                % (b['name'], n))
   793     13302      1286628     96.7      1.2              if b['name'] in buildernames:
   794                                                           raise ValueError("duplicate builder name %s"
   795                                                                            % b['name'])

We're down to 25% of the runtime for this check now. If we apply the same treatment to a few of the other data structures here, we get the total time for test-masters down to 45 seconds!

I've landed resulting patch into our repo. I encourage you to upgrade!

Comments