Hi Folks, here is an update on what we’ve been working on and where things stand.
The development team is working on Platform Upgrade: Ruby and Rails (Project PURR) (which we mentioned in the last post) and the project is going very well. We are about 1/3 through the project and on target for an on schedule completion.
Folks have been contacting us about projects that were very far out of date. At first, we were getting an occasional request, were quickly rescheduling the project analysis and getting on with other things. However, after more than a few request, we took another look.
It turns out that back in November, we had flagged a failure we were getting from a bug in the code that raised an exception during analyses. There were only a few projects affected, so we created a new failure group to monitor the situation. The number of affected projects in this failure group blossomed rapidly to 120,000. Because the failure was due to a deployment problem that had been quickly resolved, we didn’t go back to revisit the problem. Those projects affected should have had their analyses rescheduled and all would be well. Except that the failure group was created with the “do not retry” switch set. So, these 120,000 projects sat in permanent failure without being rescheduled.
In the interim, we started to experience heavy database loads and project analyses were starting to take 8 hours each to complete. Loading up 18 crawlers with 4 analyses each was blocking all other jobs and would take over 800 days to analyze the approximately 250,000 projects with activity. Plus, these 72 concurrent analyses were using all available memory on the database server, so other processes trying to do Fetch, Import, or SLOC jobs or even just web requests running queries were starting to get Database Out of Memory errors, which typically were killing the processes. Analyses were starting to fall behind with the oldest repositories now approaching 30+ days old when they should have been at most 3 days old.
We looked at our database configuration, and found that autovacuum had been disabled at some point in the past. That seemed to be the clearest starting point. We re-enabled autovacuum and began vacuuming and analyzing the tables. This improved performance enough that we increased the number of jobs that the crawlers could run until the repositories were all up to date.
Except for that failure group, the repositories for which were being filtered out of these results.
During this catch up period, we were focused on the number of jobs in the backlog and the age of the oldest repositories. As the system got caught up, we throttled back the crawlers because we were starting to get serious load on the database again that was impacting the web site. Folks started to complain that pages were timing out with 504 errors. So we went back to the database and looked at logs, analyze plans, query structure, dead tuples, vacuum and analyze times, et. cetera. There were two areas of concern. The first was that even with autovacuum re-enabled, the database just wasn’t keeping up with autovacuuming the tables. The second was that the query planner was off by large magnitudes for some queries.
To address the first, we took down the crawlers and manually ran “vacuum analyze” on each of the critical tables. Stopping the crawlers was required because the vacuum process couldn’t get enough time in the tables to begin processing. That reveals how heavily these tables are used. Then we adjusted the ‘work_mem’ value to be inline with what the “explain analyze” information was providing for our longest running queries. Then, for good measure, we changed the threshold at which autovacuum would be triggered to get autovacuum to run more frequently for smaller periods of time. We’re still monitoring those changes and expect to perform more tuning. The upshot is that with up-to-date vacuum and analyze jobs, the query planner is doing better at estimating the queries and the database is allocating a more reasonable amount of memory for each query node.
The result of this work is that the queries for the commits pages down from over 92,000 ms to around 2,000 ms and the contributions queries dropped from over 63,000 ms down to under 10,000 ms. Still way too long, but fast enough to stop the 504 problems. Plus, the Analyze jobs are running under 2 hours for the longest running jobs instead of over 8 hours for any job.
Now we return to that failure group. Over a week ago, we started re-processing those 120,000 projects, 20,000 at a time. Each block of rescheduled projects loads up the crawlers and takes a while to work through. When the analyze jobs are rescheduled, we see the evidence in the increase of the job backlog plus the date of the oldest repositories. When the job backlog comes down and we are back to having the oldest repositories updated with 3 days, then we schedule the next block of 20,000. We are about half way through this process and should be done by next week.
We will also have to continue halting the crawlers and manually performing ‘vacuum analyze’ on the key analysis tables to ensure they remain performant. The correct and intended solution is to separate the analytics database from the web application database. The current work in Project PURR is helping us with this goal by identifying those classes and database tables that are required only for the web application, those that are shared by web application and analytics engine and those only used by the analytics engine. Our goal is to complete PURR and the database refactoring in 2015 while also adding some nice new features to the Open Hub as well.
Thank you to everyone who contacted us through email, the forums, twitter, smoke signals and telepathy. And thank you again for your support and continued patience as we take some time to re-jigger the Open Hub plumbing. Most of all, thank you for being part of the Open Source community.