We’ve had a number of questions about why project analyses have gotten so far out of date.
As mentioned in the last blog post, we upgraded PostgreSQL from 9.2 to 9.4, which caused some significant degradation of performance. Not because of any problem with PostgreSQL 9.4! Not at all. The upgrade process performs a series of incremental analyze steps so that the new query planner has suitable table statistics with which to work. Even with this approach, our fairly massive critical tables are too large for these small, incremental analyze steps to successfully categorize the table and generate accurate table statistics.
After about a week of running on the upgraded version and trying the standard tools we have to maintain performance targets, it was pretty clear that we needed to bite the bullet and perform “vacuum full” on those critical tables. This has been done and we saw a sizable decrease in response time (which is want we wanted).
Shortly after addressing the table statistics issue, we had to push an update to our crawlers. We’ve not done this in quite a while. We had updated the Ruby version running the old Ohloh code set from 1.9.3 to 2.0.0, which is the highest Ruby version we can use on our old Rails 2.3.18 code base. But we only did that on the web heads. There was a good deal more complexity in the crawlers that made us uncomfortable doing that upgrade on all 18 crawlers.
But we had to. We’ve made some database schema improvements and the crawler code had to be brought back into alignment. We upgraded the Ruby version on all the crawlers and then pushed the latest version of code to them all. Everything looked OK and so we started the job scheduler and saw everything come up and start processing. Whoot!
But a day later, we noticed that a massive amount of CompleteJobs — these perform Fetch, Import, and SLOC in order — were failing because they were being killed by the host. No other information. This lead us into over a week of checking logs, monitoring processes, instrumenting the code, etc. We were right to be concerned about updating Ruby on the crawlers. 🙂
We found the root cause recently and quickly deployed a fix.
So, we’re up and running on our upgraded database with our crawlers now in sync with our updated database schema and the CompleteJobs running as expected. Unfortunately, these challenges have built up a sizable backlog of jobs. We apologize for this delay; it only strengthens our resolve to separate the analytics processing database from the front end web application database and ensure each is optimized to perform its most important tasks independently.