Upgrading Crawlers

Hail Hubbites!

At This Point

Here is where things stand: the Open Hub is running on Ruby 1.9.3 on Ubuntu 14 (yay!), the database is running on a new RAID array made of disks supported by our hardware and rated for the use we are giving it (yay!), and we are making progress on our next major release — Orgs Phase 2 (yay!).  We did a release on Thursday, August 7 that broke some images — sparklines and animated gifs.  We deployed a release on Monday, August 11, but fixing the sparklines necessitated backing out a fix to a encoding-based search defect, so we’re working on addressing that in a different way.

At the heart of the complexities are a few key factors.  The largest is that we are struggling under heavy technical debt: Rails is now out in version 4.1.0 and we are running Rails 2.3.18, and while our web servers have been recently updated to Ubuntu 14 and Ruby 1.9.3, our crawlers are still on CentOS with REE 1.8.7.  Additionally, our large database is encoded as SQL-ASCII which has been causing huge complications as we move towards the latest Ruby and Rails.  Just search for “incompatible character encoding Rails”.

Up Next

Next up in the upgrade plan is to upgrade our crawlers. Orgs Phase 2, which will take the Organization feature out of Beta, has a new type of analysis to pre-calculate stats about orgs on the Open Hub.  This analysis will run on the crawlers, but only if they are running Ruby 1.9.3.  So, the crawlers need to be updated.

To update the crawlers, our plan is to turn off ALL the crawlers, replace the OS on one half of the 18 crawlers, bring everything back up, running with 1/2 the crawlers while we install libraries and the application on the new OS’s.  Why do it this way?  Because any crawler can run analysis that may update a repository on any two different other crawlers.  The repositories are stored on their own large-capacity drives, which means we can install a new OS, bring up the server, mount the storage drive and it will be immediately accessible to other servers performing analysis.

Then, we do it again, but with the remaining crawlers.


Our analysis lag time is typically 3 days.  This means that within a 3 day period, all active projects are checked and their analysis updated. Right now, the median analysis age is 15 days and 10% of analyses are older than 26 days.  Ow.

We could wait until most of the analysis has caught up and then do the OS updates.  However, then we cannot deploy Orgs Phase 2, which is nearing it’s Ready To Ship point.  Additionally, there should be some performance improvement by switching to Ruby 1.9.3.

It should take a less than an hour to update the OS on half of the crawlers (it can be done in parallel).  During that time no analysis will be done.  Then we’ll install the libraries and application, and bring up the Job Scheduler daemon (and verify it works!).  We’ll run that for a bit, then do it again with the remaining servers.  Again, there will be no analysis performed when we are upgrading the OS’s.  This means that the analysis will again slip a bit further behind.  But, with the improved infrastructure, we should see the crawlers get caught up much, much faster.

We should start this process in the few days.

Thanks for your continued patience as we effect these updates and improvements. And, as always, thank you for being part of the open source community!

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor


  • Lukas

    Btw, the OpenHub-logo appears a bit clipped for me in Firefox. It looks fine in IE though.


    The right side of the B in hub, and also the bottom of the Brown ring of quality* seem a little clipped.

    * http://dilbert.com/strips/comic/1996-06-11/

    • http://degenportnoy.blogspot.com/ Peter Degen-Portnoy

      Thanks, Lukas, for pointing this out. I’ve created a defect in our backlog.

  • Per Hedbor

    I do not know if this is related, but commiters whose name contain non-US characters has recently stopped being correctly shown.

    As an example:

    https://www.openhub.net/p/pike/contributors/31510028031768 (should be 郭雪松, I think, it is sort of hard to know who is who. :)) and https://www.openhub.net/p/pike/contributors/31507880099490 (should be Henrik Grubbström)

    • http://degenportnoy.blogspot.com/ Peter Degen-Portnoy

      Thanks for the heads up. This is definitely related to our recent upgrade from Ruby 1.8.7 to Ruby 1.9.3 and that fact that our database is still encoded in SQL-ASCII. We are planning the upgrade of the database to UTF-8, which will be a significant step to eliminating the large variety of encoding errors we have been having recently.

  • Pingback: August 22, 2014 Release | Black Duck Open Hub Blog

  • Olivier Mengué

    You still have issues with encoding in your croawlers: https://www.openhub.net/topics/9289

  • ly thanh tin

    thanks! it’s very useful for me!
    Here everyone can choose a gift for Dad on Christmas http://www.whattogetmydadforchristmas2014.com/

  • Brian Drummond

    If you update a project (which moved to a new repo last year) is there a way to bump its reanalysis? https://www.openhub.net/p/ghdl is highly active but looks dead because it was last analysed well over a month ago, when it was still pointing at the old repo. I’d expect updating certain settings like “code locations” would add a project to a list to be re-analyzed, but apparently not…