Over the weekend starting on Friday, May 5, 2017, we deployed a significant upgrade to our architecture and we’d like to share some details.
In The Beginning
Above is a picture of our architecture before the weekend deployment. We had four applications using the same database:
- FISbot — our Fetch, Import, SLOC bot
- Ohloh analysis — Project, account, organization, etc. analysis generation
- Ohloh UI — web and API application
- Ohloh Utility (cron jobs)
The database was a single PostgreSQL 9.6 database that was over 1.6 TB in size. With the delivery of the eFISbot features to support Fetch, Import and SLOC operations for the Knowledge Base team as well as our own Open Hub, we clearly saw that even a modest increase in eFISbot request processing impacted the database and resulted in poor performance for the web application. In brief, we couldn’t scale to support the anticipated load on eFISbot.
In our plans for 2017, we committed to making the backend screamingly fast and talked about how we gotten approval for new servers to support this. Starting at 8 PM EDT on Friday, May 5 we took a major step towards delivering on that commitment. We called it the “FIS Ohloh Database Split” (FODS).
We moved the four largest tables that are critical to Fetch, Import, and SLOC operations to a new FIS database and set up PostgreSQL Foreign Data Wrapper (FDW) to send data back and forth between the two. This moved the bulk of the 1.6 TB of the database over to the new (and powerful) servers, leaving only 65 GB on the original database servers.
Not Yet Done
As is often the case in significant architectural upgrades, not everything worked smoothly out of the box. We are seeing two classes of problems. One is apparent when viewing Commit Summary pages for the largest projects. We’re seeing queries taking a massive amount of time. The other is the time it takes to execute project analysis jobs: analyze jobs that used to take a couple of minutes can run for more than half a day. Obviously, this is causing a massive backlog of projects that are not getting analyzed. Normally, we complete an AnalyzeJob in a few minutes and can process between 600 and 1000 jobs per hour.
Part of the analyze job run duration, we are also seeing analyze jobs fail in the last step of the analyze job with a PostgreSQL Serialization error. This means that there are analyze jobs that have not been able to complete successfully. Right now (I just checked), we have over 131K AnalyzeJobs scheduled, with about 600 completed in the past few days and about 200 that have failed with 99% of them failing with the PostgreSQL Serialization error, presumably related to our use of the FDW.
Both of these seem to be traceable to the FDW. I’m not blaming the FDW for anything. We are reasonably certain that we are not using the FDW optimally. There were some obvious changes that were needed by adopting FDW and we did those during our development and testing cycle. Then there were things that we did not predict or behave differently in production than they did in staging, even though we did a lot of work to simulate our production environment in staging. But as is usually the case, there are some new things that are found only in production. The two cases described above fall into that category.
Even with these issues, we feel the FODS deployment was a tremendous success because the vast majority of pages display at least as fast, plus we have tremendous capacity to grow the eFISbot service.
Here’s what we doing about it
For the project analysis jobs, we examined the issue from a number of perspectives and identified a few tables that we could migrate from the OH DB to the FIS DB. Initial tests show that Analyze queries that took 12K ms to run are now running in 1.6K ms, almost 8 times faster.
For the Commit pages, we are working with the SecondBase gem to allow the Ohloh UI to directly access the FIS DB for the data stored there rather than push massive queries through the FDW. Initial tests show that this also results in multiples of better performance, although we’re still gathering the numbers.
While the use of a direct connection to the FIS DB will improve performance on the vast majority of Commit pages, the largest projects still represent a special problem. Right now we have just over 676K projects. Only 3 of them have more than 1,000 enlistments — Fedora Packages, KDE, and Debian. All three of these are Linux distributions. We briefly mentioned distributions in our post about 2016 plans and now is the time to implement them. The plan is to create a new entity called a “Distribution,” which represents a collection of Open Source Software projects. This is different from an Organization because the Distribution represents a packaged and related collection of OSS projects. By doing this, we can process each of the projects within the Distribution individually and then aggregate the analysis results for the Distribution.
The way this would work would be that, in the case of Fedora packages with 11,956 enlistments, we would create a project for each of the enlistments and then group all those new projects into the Distribution. When looking at the Distribution, we can provide the list of projects, links to each of them, plus aggregate the data from those projects with a new “Distribution Analysis”. Most importantly, when displaying the Distribution, we won’t have to try to aggregate the commits from all 12K enlistments into a single view.
We are working quickly on testing and verifying behavior using the new distribution of tables and the second DB connection. We hope to have improvements deployed within a week.