About the FODS Architecture

Over the weekend starting on Friday, May 5, 2017, we deployed a significant upgrade to our architecture and we’d like to share some details.

In The Beginning

single-db-architecture

Above is a picture of our architecture before the weekend deployment.  We had four applications using the same database:

  1. FISbot — our Fetch, Import, SLOC bot
  2. Ohloh analysis — Project, account, organization, etc. analysis generation
  3. Ohloh UI — web and API application
  4. Ohloh Utility (cron jobs)

The database was a single PostgreSQL 9.6 database that was over 1.6 TB in size. With the delivery of the eFISbot features to support Fetch, Import and SLOC operations for the Knowledge Base team as well as our own Open Hub, we clearly saw that even a modest increase in eFISbot request processing impacted the database and resulted in poor performance for the web application. In brief, we couldn’t scale to support the anticipated load on eFISbot.

Current Architecture

In our plans for 2017, we committed to making the backend screamingly fast and talked about how we gotten approval for new servers to support this. Starting at 8 PM EDT on Friday, May 5 we took a major step towards delivering on that commitment. We called it the “FIS Ohloh Database Split” (FODS).

split-db-architecture

We moved the four largest tables that are critical to Fetch, Import, and SLOC operations to a new FIS database and set up PostgreSQL Foreign Data Wrapper (FDW) to send data back and forth between the two. This moved the bulk of the 1.6 TB of the database over to the new (and powerful) servers, leaving only 65 GB on the original database servers.

Not Yet Done

As is often the case in significant architectural upgrades, not everything worked smoothly out of the box. We are seeing two classes of problems. One is apparent when viewing Commit Summary pages for the largest projects.  We’re seeing queries taking a massive amount of time.  The other is the time it takes to execute project analysis jobs: analyze jobs that used to take a couple of minutes can run for more than half a day. Obviously, this is causing a massive backlog of projects that are not getting analyzed. Normally, we complete an AnalyzeJob in a few minutes and can process between 600 and 1000 jobs per hour.

Part of the analyze job run duration, we are also seeing analyze jobs fail in the last step of the analyze job with a PostgreSQL Serialization error. This means that there are analyze jobs that have not been able to complete successfully. Right now (I just checked), we have over 131K AnalyzeJobs scheduled, with about 600 completed in the past few days and about 200 that have failed with 99% of them failing with the PostgreSQL Serialization error, presumably related to our use of the FDW.

Both of these seem to be traceable to the FDW. I’m not blaming the FDW for anything. We are reasonably certain that we are not using the FDW optimally. There were some obvious changes that were needed by adopting FDW and we did those during our development and testing cycle. Then there were things that we did not predict or behave differently in production than they did in staging, even though we did a lot of work to simulate our production environment in staging. But as is usually the case, there are some new things that are found only in production. The two cases described above fall into that category.

Even with these issues, we feel the FODS deployment was a tremendous success because the vast majority of pages display at least as fast, plus we have tremendous capacity to grow the eFISbot service.

Here’s what we doing about it

For the project analysis jobs, we examined the issue from a number of perspectives and identified a few tables that we could migrate from the OH DB to the FIS DB. Initial tests show that Analyze queries that took 12K ms to run are now running in 1.6K ms, almost 8 times faster.

For the Commit pages, we are working with the SecondBase gem to allow the Ohloh UI to directly access the FIS DB for the data stored there rather than push massive queries through the FDW. Initial tests show that this also results in multiples of better performance, although we’re still gathering the numbers.

While the use of a direct connection to the FIS DB will improve performance on the vast majority of Commit pages, the largest projects still represent a special problem. Right now we have just over 676K projects.  Only 3 of them have more than 1,000 enlistments — Fedora Packages, KDE, and Debian. All three of these are Linux distributions. We briefly mentioned distributions in our post about 2016 plans and now is the time to implement them. The plan is to create a new entity called a “Distribution,” which represents a collection of Open Source Software projects.  This is different from an Organization because the Distribution represents a packaged and related collection of OSS projects. By doing this, we can process each of the projects within the Distribution individually and then aggregate the analysis results for the Distribution.

The way this would work would be that, in the case of Fedora packages with 11,956 enlistments, we would create a project for each of the enlistments and then group all those new projects into the Distribution.  When looking at the Distribution, we can provide the list of projects, links to each of them, plus aggregate the data from those projects with a new “Distribution Analysis”. Most importantly, when displaying the Distribution, we won’t have to try to aggregate the commits from all 12K enlistments into a single view.

Next Steps

We are working quickly on testing and verifying behavior using the new distribution of tables and the second DB connection. We hope to have improvements deployed within a week.

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor
  • Joël Krähemann

    Hi I can’t recover my account. So you should check your spool.

    • Hi Joël; You’re not the first to report difficulty with this. We’re working with IT to see if we can’t quickly address the issue.

  • Hi Peter!

    Great work there. I hope you can track down what slowed the project analysis to a craw as I’d love to see our only-9-days-old data being re-analyzed but even if it takes a while, the site is fast and that’s very valuable.

    Keep it up!

    • I see projects are updating quickly all over so it is working! Awesome.

  • This is something worth sharing. I am amazed to see how well you organized this post on FODS Architecture

  • Alexander Opitz

    All nice posts about what is done … but at the moment I don’t see any progress.
    Code of projects do not get fetched (and so not analyzed) and the Forum is also unusable. Especially https://www.openhub.net/forums/10 … the posts are sorted by what, last access? No chance to find my posts, as the internal search do not really work. Also, it lacks on response (maybe nobody sees the new entries).
    I know this all is an unpaid work, but from my POI I don’t see any progress. 🙁

    • Yeah, the same here – things seem to have slowed to a crawl. Software, hardware issues?

    • Michał Walenciak

      @peterdegenportnoy:disqus I’ve the same observations Peter. Could you elaborate some more?

      • We are working through a set of complicated factors to evaluate and address each one in order to determine if that factor is contributing to this slow updating.

        The factors include:
        * Ensuring cross database queries are fast and efficient for searching projects and code locations
        * Ensuring cross database structures are fast and efficient for inserting and updating new records
        * Finding projects and code locations that need to be updated
        * Filtering out code locations that cannot be updated due to unrecoverable failures (like Google Code no longer existing)

        I believe that there is no single smoking gun causing this issue, but rather that we have done so much work over the past 9 months that we have to do a careful tracing of this scheduling and execution logic from end to end. And that’s what we’re doing. I have the strong feeling that we are approaching the end of the research and work, but it is very difficult to make a concrete prediction.

    • Let’s be grateful.
      They put work. Hope they get crazy scalable streamlined system.
      So then other improvements can get addresses.

      With protects that improve our community:
      Don’t ask: “When!”. Ask: “How I can help”.

      • Alexander Opitz

        I didn’t asked “When” I only pointed my feeling that nothing works and no progress to see.

        And please don’t tell an open source developer
        >> Don’t ask: “When!”. Ask: “How I can help”.<<
        that's stupid.

  • Yes, I as a user also faced what “Fedora Packages”, KDE, AUR packages actually mean.
    It is a collection of package metadata repos. Or also they have something in addition (part of distribution core architecture).
    That topics have 300-1000 enlistments, so I can’t look through repos to find out and be sure I’ve caught every addressing.

    But that more minor question, you are right focusing on making everything work. Categorizing and sorting is less priority, but very needed on Hub.

    “Distributions” is a great idea.

    “KDE” is, technically, – community, but also.
    If categorization of Distributions – coins-out important category: stock “Desktop environment” collections.

    They all are a big collections, and communities, of a smaller projects: KDE, GNOME, Mate, LXQt, Xfce, Budgie, stock Android, Cinnamon, Deepin, Enlightenment, Pantheon, Trinity, Unity.

    And that topic is hot in a Linux community now more than ever.

    Unity shuts down. GNOME in the centre while KDE ignored again, but many projects/DE slowly drift in Qt, Mate, Pantheon and Budgie rose; transition to Wayland happening.

  • I feel that some great progress has been made against the backlog. More and more projects have up-to-date stats (compared to a few months ago)

    Congrats!

    • Things are indeed starting to get back into the range we are targeting. Thanks so much for taking some time to post this!