GitHub, Performance, and Crawlers (Oh My!)

GitHub

We are very pleased to announce that new and existing users can verify their accounts with their GitHub login as an alternative to using their SMS number.  Many thanks to all those Open Hubbites who contacted us to let us know their concerns and preferences surrounding SMS validation.  We would like to note that about 1/3 of the people who initially contacted us to say that they would not use their SMS number for validation changed their minds when they learned that we do not store their phone number and will never use it to contact them, but many others were simply uncomfortable, unwilling, or unable to use SMS to validate their account. Given the feedback we received, we are very pleased that the GitHub login arose as a clear alternative to the SMS validation.

We also adjusted the new account sign up workflow to ensure that, in addition to some authoritative validation such as SMS or GitHub, we are checking for a valid email address before allowing any edits to a users’ accounts.  We are very pleased that although some users continue to create accounts that violate our Terms of Use, the number of users who choose to do so is sufficiently small enough to enable us to review and address account violations promptly and efficiently.

Performance

On the performance front, we added more focused caching to the new UI code improving the People Index page response time from 18-60 seconds down to < 1 second, the Explore Projects page response time from over 100 seconds down to less than one, as well as all widget images cutting the response time in half to typically 1.5 seconds.

We continue to focus on improving our infrastructure.  Recently we upgraded our production database from PostgreSQL 9.2 to 9.4.  The upgrade process itself went ideally.  In less than an hour, we had redirected all traffic to a notification, cleaned up all executing jobs, performed the upgrade, and restored service to the site.

Unfortunately, within the next few days, it became clear that query performance had taken a significant nosedive. The pg_upgrade tool performs a series of table analyze steps, which helps to rebuild table metadata, but given the size of a number of our central tables, it did not sufficiently analyze the tables for optimal performance.  So we took a significant step for us and shut the site down again in the middle of the night in the middle of the weekend and performed “vacuum full” operations on these central tables. This dropped query times from the order of a minute and more back down to the order of fractions of a second, and brought the server response time down from minutes to 1-3 seconds. It’s important to note that lore has it that the database obtained with Ohloh when Black Duck acquired the site in 2010 has never been shut down for “vacuum full” on any table, so this was a pretty big deal. There was much quiet rejoicing.  Engineering Dances of Victory (EDoV).

For those gentle readers who do not speak database geek, the “vacuum full” operation builds a new set of mappings of the data in a table so that the query planner can more accurately estimate the best way to get the requested data from the table.  It’s very important and can be very challenging to do on a regular basis especially in a significant production environment with very large tables; which, ahem, we have.

The restoration of server performance to the current target of under 1.2 seconds is a good thing.  This gives us confidence that we have successfully addressed the significant performance problems since deploying the new Ohloh UI, schema and database upgrades, and updates to our crawlers.  With the new, clean code-base of the new UI now responding reliably, we will focus more on addressing the other long pole in our performance tent: the crawlers and analytics processing.

Crawlers

We’ve talked about the crawlers over the past year or so.  In New UI, New Account Creation Mechanism, Project Updates we talked about what happened when we catastrophically lost one of the crawlers.  In Progress report: Catching up on outdated analyses we also talked about some of the complexity of balancing the database configuration for optimal website and crawler usage. The bottom line is that the crawlers are a risk to the website.  So, we’ve been planning and working on how we’re going to address this problem.

First, we need to get off of the aging crawlers.  Then we need to separate the analytics database from the web application database.

For the first part, we are going to virtualize our crawlers and centralize local repository storage on a SAN, which will reduce our storage needs from 27TB down to 8TB. To do this we will separate the “Fetch, Import, SLOC” (FIS) part of our work from the “Project, Account, Organization Analysis” part of our work.  We’ve built a FISbot, which we can swap out for a centralized data collection service being built by another team within Black Duck.

For the second part, we will first run the existing Ohloh code on VMs, consuming data from FISbot while we continue with the in-progress analysis of our analytics needs and design a database schema optimized for analysis.  We will build a lightweight data publisher to publish the generated analysis to the web application database.  For the time being, the web application database schema will remain unchanged.  This isn’t too much of a problem because we know the website performs almost acceptably (600 – 800 ms response time) when there is no crawler activity.  After we’ve addressed the crawler issues, we can implement improvements to the web application database to bring server response time down to the sub 500 ms target we’ve set.

Finally

As a final note, we would like to announce our intention to release our newly developed Ohloh UI code as open source.  We are currently reviewing the code and have work to do with our Black Duck colleagues to make sure all I’s are dotted and T’s are crossed.  Given the urgent and important work we have to do on the crawlers and analytics engine, this task is waiting patiently behind new features that we have promised to Product Management, but it is right behind those new feature requests.

As always, we are grateful to the Open Hub community and continue to beg your indulgence as we work fervently to provide you with consistently updated project and account analysis in a highly responsive manner.

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor