A tale of python profiling and optimizing
The Release Engineering infrastructure at Mozilla relies heavily on buildbot for much of our infrastructure. For various reasons we're running an older version based on buildbot 0.8.2, which we maintain in our own mercurial repository. We have many different masters with all sorts of different configurations.
To make sure that we don't break things too often, we wrote a tools called test-masters.sh that creates local versions of each unique master configuration and then runs a configuration check on it. Currently there are 20 unique master configurations to test, and it would take 4 minutes to run test-masters.sh on all of them on my machine. Recently sfink landed some changes which would test all the masters in parallel, which brought the time down from a previously interminable 11 minutes.
Four minutes is a long time to wait! What's taking so much time?
My go-to tool for profiling python code is cProfile. I ended up writing a small script to do the equivalent of 'buildbot checkconfig':
import cProfile import sys from buildbot.scripts.checkconfig import ConfigLoader def loadMaster(filename): ConfigLoader(configFileName=filename) cProfile.run("loadMaster(sys.argv[1])", "master.prof")
Running this against my buildbot master's master.cfg file will produce a master.prof file I can load to look at the profile.
>>> import pstats >>> pstats.Stats("master.prof").sort_stats('time').print_stats(5) Thu Nov 7 21:42:25 2013 master.prof 26601516 function calls (24716688 primitive calls) in 439.756 seconds Ordered by: internal time List reduced from 1997 to 5 due to restriction <5> ncalls tottime percall cumtime percall filename:lineno(function) 1 409.381 409.381 438.936 438.936 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_ hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py:621(loadConfig) 170046 3.907 0.000 10.706 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_ hg_b4673f1f2a86_default-py2.6.egg/buildbot/steps/shell.py:65(__init__) 222809 3.745 0.000 4.124 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_ hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:611(__init__) 1 3.025 3.025 29.352 29.352 /home/catlee/mozilla/buildbot-configs/tests-master/master.cfg:2(<module>) 170046 2.385 0.000 6.033 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_ hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:1027(__init__)
Looks like buildbot's loadConfig is taking a long time! Unfortunately we don't get any more detail than that from cProfile. To get line-by-line profiling info, I've started using kernprof. This requires a few changes to our setup. First, we don't want to run cProfile any more, so modify our test script like this:
import cProfile import sys from buildbot.scripts.checkconfig import ConfigLoader def loadMaster(filename): ConfigLoader(configFileName=filename) #cProfile.run("loadMaster(sys.argv[1])", "master.prof") loadMaster(sys.argv[1])
And since we want to get line-by-line profiling of buildbot's loadConfig, we need to annotate buildbot's code with the @profile decorator. In buildbot's master.py, I added @profile to loadConfig so it looks like this now:
@profile def loadConfig(self, f, check_synchronously_only=False): """Internal function to load a specific configuration file. Any errors in the file will be signalled by raising an exception. <snip> """
Now run kernprof.py to get our profile:
kernprof.py -l -v ../profile_master.py master.cfg Line # Hits Time Per Hit % Time Line Contents ============================================================== 621 @profile 622 def loadConfig(self, f, check_synchronously_only=False): 623 """Internal function to load a specific configuration file. Any 624 errors in the file will be signalled by raising an exception. 625 626 If check_synchronously_only=True, I will return (with None) 627 synchronously, after checking the config file for sanity, or raise an 628 exception. I may also emit some DeprecationWarnings. 629 630 If check_synchronously_only=False, I will return a Deferred that 631 fires (with None) when the configuration changes have been completed. 632 This may involve a round-trip to each buildslave that was involved.""" 633 634 1 17 17.0 0.0 localDict = {'basedir': os.path.expanduser(self.basedir)} 635 1 7 7.0 0.0 try: 636 1 68783020 68783020.0 12.0 exec f in localDict 637 except: 638 log.msg("error while parsing config file") 639 raise <snip> 785 13303 86781 6.5 0.0 for b in builders: 786 13302 92920 7.0 0.0 if b.has_key('slavename') and b['slavename'] not in slavenames: 787 raise ValueError("builder %s uses undefined slave %s" \ 788 % (b['name'], b['slavename'])) 789 6935914 42782768 6.2 7.5 for n in b.get('slavenames', []): 790 6922612 449928915 65.0 78.4 if n not in slavenames: 791 raise ValueError("builder %s uses undefined slave %s" \ 792 % (b['name'], n)) 793 13302 2478517 186.3 0.4 if b['name'] in buildernames:
Wow! Line 790 is taking 78% of our runtime! What's going on?
If I look at my config, I see that I have 13,302 builders and 13,988 slaves configured. Each builder has a subset of slaves attached, but we're still doing 6,922,612 checks to see if each slave that's configured for the builder is configured as a top-level slave. slavenames happens to be a list, so each check does a full scan of the list to see if the slave exists or not!
A very simple patch fixes this:
diff --git a/master/buildbot/master.py b/master/buildbot/master.py --- a/master/buildbot/master.py +++ b/master/buildbot/master.py @@ -761,19 +761,19 @@ class BuildMaster(service.MultiService): errmsg = "c['schedulers'] must be a list of Scheduler instances" assert isinstance(schedulers, (list, tuple)), errmsg for s in schedulers: assert interfaces.IScheduler(s, None), errmsg assert isinstance(status, (list, tuple)) for s in status: assert interfaces.IStatusReceiver(s, None) - slavenames = [s.slavename for s in slaves] + slavenames = set(s.slavename for s in slaves) # convert builders from objects to config dictionaries builders_dicts = [] for b in builders: if isinstance(b, BuilderConfig): builders_dicts.append(b.getConfigDict()) elif type(b) is dict: builders_dicts.append(b)
Now our checks are into a set instead of a list, which is an O(log(n)) operation instead of O(n). Let's re-run our profile with this patch:
File: /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py Function: loadConfig at line 621 Total time: 109.885 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 621 @profile 622 def loadConfig(self, f, check_synchronously_only=False): 623 """Internal function to load a specific configuration file. Any 624 errors in the file will be signalled by raising an exception. 625 626 If check_synchronously_only=True, I will return (with None) 627 synchronously, after checking the config file for sanity, or raise an 628 exception. I may also emit some DeprecationWarnings. 629 630 If check_synchronously_only=False, I will return a Deferred that 631 fires (with None) when the configuration changes have been completed. 632 This may involve a round-trip to each buildslave that was involved.""" 633 634 1 30 30.0 0.0 localDict = {'basedir': os.path.expanduser(self.basedir)} 635 1 13 13.0 0.0 try: 636 1 46268025 46268025.0 42.1 exec f in localDict 637 except: 638 log.msg("error while parsing config file") 639 raise <snip> 785 13303 56199 4.2 0.1 for b in builders: 786 13302 60438 4.5 0.1 if b.has_key('slavename') and b['slavename'] not in slavenames: 787 raise ValueError("builder %s uses undefined slave %s" \ 788 % (b['name'], b['slavename'])) 789 6935914 27430456 4.0 25.0 for n in b.get('slavenames', []): 790 6922612 29028880 4.2 26.4 if n not in slavenames: 791 raise ValueError("builder %s uses undefined slave %s" \ 792 % (b['name'], n)) 793 13302 1286628 96.7 1.2 if b['name'] in buildernames: 794 raise ValueError("duplicate builder name %s" 795 % b['name'])
We're down to 25% of the runtime for this check now. If we apply the same treatment to a few of the other data structures here, we get the total time for test-masters down to 45 seconds!
I've landed resulting patch into our repo. I encourage you to upgrade!
Comments