The Release Engineering infrastructure at Mozilla relies heavily on buildbot
for much of our infrastructure. For various reasons we're running an older
version based on buildbot 0.8.2, which we maintain in our own mercurial
repository. We have many different masters with all sorts of different
configurations.
To make sure that we don't break things too often, we wrote a tools called
test-masters.sh that creates local versions of each unique master
configuration and then runs a configuration check on it. Currently there
are 20 unique master configurations to test, and it would take 4 minutes
to run test-masters.sh on all of them on my machine. Recently sfink landed some changes
which would test all the masters in parallel, which brought the time down
from a previously interminable 11 minutes.
Four minutes is a long time to wait! What's taking so much time?
My go-to tool for profiling python code is cProfile. I ended up writing a
small script to do the equivalent of 'buildbot checkconfig':
import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader
def loadMaster(filename):
ConfigLoader(configFileName=filename)
cProfile.run("loadMaster(sys.argv[1])", "master.prof")
Running this against my buildbot master's master.cfg file will produce a
master.prof file I can load to look at the profile.
>>> import pstats
>>> pstats.Stats("master.prof").sort_stats('time').print_stats(5)
Thu Nov 7 21:42:25 2013 master.prof
26601516 function calls (24716688 primitive calls) in 439.756 seconds
Ordered by: internal time
List reduced from 1997 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1 409.381 409.381 438.936 438.936 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py:621(loadConfig)
170046 3.907 0.000 10.706 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/steps/shell.py:65(__init__)
222809 3.745 0.000 4.124 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:611(__init__)
1 3.025 3.025 29.352 29.352 /home/catlee/mozilla/buildbot-configs/tests-master/master.cfg:2(<module>)
170046 2.385 0.000 6.033 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:1027(__init__)
Looks like buildbot's loadConfig is taking a long time! Unfortunately we
don't get any more detail than that from cProfile. To get line-by-line
profiling info, I've started using kernprof. This requires a few changes
to our setup. First, we don't want to run cProfile any more, so modify our
test script like this:
import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader
def loadMaster(filename):
ConfigLoader(configFileName=filename)
#cProfile.run("loadMaster(sys.argv[1])", "master.prof")
loadMaster(sys.argv[1])
And since we want to get line-by-line profiling of buildbot's loadConfig,
we need to annotate buildbot's code with the @profile decorator. In
buildbot's master.py, I added @profile to loadConfig so it looks like this
now:
@profile
def loadConfig(self, f, check_synchronously_only=False):
"""Internal function to load a specific configuration file. Any
errors in the file will be signalled by raising an exception.
<snip>
"""
Now run kernprof.py to get our profile:
kernprof.py -l -v ../profile_master.py master.cfg
Line # Hits Time Per Hit % Time Line Contents
==============================================================
621 @profile
622 def loadConfig(self, f, check_synchronously_only=False):
623 """Internal function to load a specific configuration file. Any
624 errors in the file will be signalled by raising an exception.
625
626 If check_synchronously_only=True, I will return (with None)
627 synchronously, after checking the config file for sanity, or raise an
628 exception. I may also emit some DeprecationWarnings.
629
630 If check_synchronously_only=False, I will return a Deferred that
631 fires (with None) when the configuration changes have been completed.
632 This may involve a round-trip to each buildslave that was involved."""
633
634 1 17 17.0 0.0 localDict = {'basedir': os.path.expanduser(self.basedir)}
635 1 7 7.0 0.0 try:
636 1 68783020 68783020.0 12.0 exec f in localDict
637 except:
638 log.msg("error while parsing config file")
639 raise
<snip>
785 13303 86781 6.5 0.0 for b in builders:
786 13302 92920 7.0 0.0 if b.has_key('slavename') and b['slavename'] not in slavenames:
787 raise ValueError("builder %s uses undefined slave %s" \
788 % (b['name'], b['slavename']))
789 6935914 42782768 6.2 7.5 for n in b.get('slavenames', []):
790 6922612 449928915 65.0 78.4 if n not in slavenames:
791 raise ValueError("builder %s uses undefined slave %s" \
792 % (b['name'], n))
793 13302 2478517 186.3 0.4 if b['name'] in buildernames:
Wow! Line 790 is taking 78% of our runtime! What's going on?
If I look at my config, I see that I have 13,302 builders and 13,988 slaves
configured. Each builder has a subset of slaves attached, but we're still doing
6,922,612 checks to see if each slave that's configured for the builder is
configured as a top-level slave. slavenames happens to be a list, so each
check does a full scan of the list to see if the slave exists or not!
A very simple patch fixes this:
diff --git a/master/buildbot/master.py b/master/buildbot/master.py
--- a/master/buildbot/master.py
+++ b/master/buildbot/master.py
@@ -761,19 +761,19 @@ class BuildMaster(service.MultiService):
errmsg = "c['schedulers'] must be a list of Scheduler instances"
assert isinstance(schedulers, (list, tuple)), errmsg
for s in schedulers:
assert interfaces.IScheduler(s, None), errmsg
assert isinstance(status, (list, tuple))
for s in status:
assert interfaces.IStatusReceiver(s, None)
- slavenames = [s.slavename for s in slaves]
+ slavenames = set(s.slavename for s in slaves)
# convert builders from objects to config dictionaries
builders_dicts = []
for b in builders:
if isinstance(b, BuilderConfig):
builders_dicts.append(b.getConfigDict())
elif type(b) is dict:
builders_dicts.append(b)
Now our checks are into a set instead of a list, which is an O(log(n))
operation instead of O(n). Let's re-run our profile with this patch:
File: /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py
Function: loadConfig at line 621
Total time: 109.885 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
621 @profile
622 def loadConfig(self, f, check_synchronously_only=False):
623 """Internal function to load a specific configuration file. Any
624 errors in the file will be signalled by raising an exception.
625
626 If check_synchronously_only=True, I will return (with None)
627 synchronously, after checking the config file for sanity, or raise an
628 exception. I may also emit some DeprecationWarnings.
629
630 If check_synchronously_only=False, I will return a Deferred that
631 fires (with None) when the configuration changes have been completed.
632 This may involve a round-trip to each buildslave that was involved."""
633
634 1 30 30.0 0.0 localDict = {'basedir': os.path.expanduser(self.basedir)}
635 1 13 13.0 0.0 try:
636 1 46268025 46268025.0 42.1 exec f in localDict
637 except:
638 log.msg("error while parsing config file")
639 raise
<snip>
785 13303 56199 4.2 0.1 for b in builders:
786 13302 60438 4.5 0.1 if b.has_key('slavename') and b['slavename'] not in slavenames:
787 raise ValueError("builder %s uses undefined slave %s" \
788 % (b['name'], b['slavename']))
789 6935914 27430456 4.0 25.0 for n in b.get('slavenames', []):
790 6922612 29028880 4.2 26.4 if n not in slavenames:
791 raise ValueError("builder %s uses undefined slave %s" \
792 % (b['name'], n))
793 13302 1286628 96.7 1.2 if b['name'] in buildernames:
794 raise ValueError("duplicate builder name %s"
795 % b['name'])
We're down to 25% of the runtime for this check now. If we apply the same treatment to a few of the other data structures here, we get the total time for test-masters down to 45 seconds!
I've landed resulting patch into our repo. I encourage you to upgrade!