Progress report: Catching up on outdated analyses

Hi Folks, here is an update on what we’ve been working on and where things stand.

The development team is working on Platform Upgrade: Ruby and Rails (Project PURR) (which we mentioned in the last post) and the project is going very well. We are about 1/3 through the project and on target for an on schedule completion.

Folks have been contacting us about projects that were very far out of date. At first, we were getting an occasional request, were quickly rescheduling the project analysis and getting on with other things.  However, after more than a few request, we took another look.  

It turns out that back in November, we had flagged a failure we were getting from a bug in the code that raised an exception during analyses. There were only a few projects affected, so we created a new failure group to monitor the situation.  The number of affected projects in this failure group blossomed rapidly to 120,000. Because the failure was due to a deployment problem that had been quickly resolved, we didn’t go back to revisit the problem. Those projects affected should have had their analyses rescheduled and all would be well.  Except that the failure group was created with the “do not retry” switch set.   So, these 120,000 projects sat in permanent failure without being rescheduled.

In the interim, we started to experience heavy database loads and project analyses were starting to take 8 hours each to complete. Loading up 18 crawlers with 4 analyses each was blocking all other jobs and would take over 800 days to analyze the approximately 250,000 projects with activity.  Plus, these 72 concurrent analyses were using all available memory on the database server, so other processes trying to do Fetch, Import, or SLOC jobs or even just web requests running queries were starting to get Database Out of Memory errors, which typically were killing the processes.  Analyses were starting to fall behind with the oldest repositories now approaching 30+ days old when they should have been at most 3 days old.

We looked at our database configuration, and found that autovacuum had been disabled at some point in the past. That seemed to be the clearest starting point.  We re-enabled autovacuum and began vacuuming and analyzing the tables.  This improved performance enough that we increased the number of jobs that the crawlers could run until the repositories were all up to date.

Except for that failure group, the repositories for which were being filtered out of these results.

During this catch up period, we were focused on the number of jobs in the backlog and the age of the oldest repositories.  As the system got caught up, we throttled back the crawlers because we were starting to get serious load on the database again that was impacting the web site.  Folks started to complain that pages were timing out with 504 errors.  So we went back to the database and looked at logs, analyze plans, query structure, dead tuples, vacuum and analyze times, et. cetera. There were two areas of concern.  The first was that even with autovacuum re-enabled, the database just wasn’t keeping up with autovacuuming the tables.  The second was that the query planner was off by large magnitudes for some queries.

To address the first, we took down the crawlers and manually ran “vacuum analyze” on each of the critical tables.  Stopping the crawlers was required because the vacuum process couldn’t get enough time in the tables to begin processing. That reveals how heavily these tables are used.  Then we adjusted the ‘work_mem’ value to be inline with what the “explain analyze” information was providing for our longest running queries. Then, for good measure, we changed the threshold at which autovacuum would be triggered to get autovacuum to run more frequently for smaller periods of time.  We’re still monitoring those changes and expect to perform more tuning. The upshot is that with up-to-date vacuum and analyze jobs, the query planner is doing better at estimating the queries and the database is allocating a more reasonable amount of memory for each query node.

The result of this work is that the queries for the commits pages down from over 92,000 ms to around 2,000 ms and the contributions queries dropped from over 63,000 ms down to under 10,000 ms.  Still way too long, but fast enough to stop the 504 problems.  Plus, the Analyze jobs are running under 2 hours for the longest running jobs instead of over 8 hours for any job.

Now we return to that failure group.  Over a week ago, we started re-processing those 120,000 projects, 20,000 at a time.  Each block of rescheduled projects loads up the crawlers and takes a while to work through.  When the analyze jobs are rescheduled, we see the evidence in the increase of the job backlog plus the date of the oldest repositories. When the job backlog comes down and we are back to having the oldest repositories updated with 3 days, then we schedule the next block of 20,000.  We are about half way through this process and should be done by next week.

We will also have to continue halting the crawlers and manually performing ‘vacuum analyze’ on the key analysis tables to ensure they remain performant.  The correct and intended solution is to separate the analytics database from the web application database. The current work in Project PURR is helping us with this goal by identifying those classes and database tables that are required only for the web application, those that are shared by web application and analytics engine and those only used by the analytics engine.  Our goal is to complete PURR and the database refactoring in 2015 while also adding some nice new features to the Open Hub as well.

Thank you to everyone who contacted us through email, the forums, twitter, smoke signals and telepathy.  And thank you again for your support and continued patience as we take some time to re-jigger the Open Hub plumbing.  Most of all, thank you for being part of the Open Source community.

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor
  • Kaz Nishimura

    My first comment seems lost, so…
    I know you are working hard but project status updates are apparently getting slower these days according to your Update Status page. Even the Update Status page is not updated regularly, either. I hope you can resolve your issues soon and make Open Hub up to date again.

    • Hi Kaz;

      Thanks for your comments. May I ask which Update Status page in particular is the one to which you are referring?

      • Kaz Nishimura

        http://blog.openhub.net/status/
        This is the page. It can be reached from your FAQ http://blog.openhub.net/faq-2/

        • Thank you, Kaz. I’ve updated the link for that image, so it should update correctly, although it make take a little time to load the first time it is viewed since the image is being dynamically generated on demand.

          We are trying to maintain a good balance between loading up the back-end processing and keeping the front-end UI responsive. Favoring one directly impacts the other. That said, I’ve upped the thresholds on half of our back-end systems to try and catch up faster.

          • David Martínez Moreno

            Any of my teachers in university would have failed you because that graph has no units!
            Still, it would be *great* to show the size of the queue as I suggested the other day. Or graph the median age of the projects since the last update. Anything but a meaningless graph.

            I’m sure I speak for a lot of people when I say that we understand that you may have issues, but a) offering to put things manually on the queue is not going to scale up, and b) having an insight on how good (or bad) are things now can shed some light on the project status.

            Thanks!

          • Thank you for your suggestions, David. I will add them to our backlog and talk them over with product management. We can certainly find an elegant way to include this information more publicly and effectively.

          • Kaz Nishimura

            It appears the graph image is not updated as often as it should be. I guess your system is too overloaded these days but we cannot know how well your system is working if the graph is almost out of date.

  • Mikael “MMN-o” Nordfeldth

    Cool. I was starting to wonder why https://www.openhub.net/p/gnusocial/ hasn’t received any analysis the last three months, so I guess this is related. I’ll have to check back in the end of this week to see if you’re done with the tedious maintenance.

    Hope no other problems arise during this time for you! Cheers.

  • Per Hedbor

    Any new progress report? https://www.openhub.net/p/pike is still about a month behind the times.

    • We still have a large number of jobs to process and with the recent increased load to the site (see the latest post: http://blog.openhub.net/2015/02/spammers_heaven/), it’s been difficult making progress.

      However, I’ve rescheduled Pike for an update and raised the priority of the analysis job.

      • Per Hedbor

        Thank you!

  • Refdoc

    Hi,

    My code contribution has not been analysed now for 3/12.
    Also a project that I added (Bibledit) seems to not move towards code analysis for now 2 weeks or so.

    • Thanks for the heads up. I am rescheduling the analyses for the projects to which you contributed.

      Please note that some of the repositories were not accessible and reported the following error: “Server SSL certificate verification failed: certificate issued for a different hostname, issuer is not trusted”

      • refdoc

        We use a self signed certificate on our own server. But we have always done so. The failure to update is only since a few months.

        • I understand. This is blocking our ability to update project analyses and your analysis as well. Please feel free to notify us on our forums when this is resolved. Thanks!

  • hoi

    The systemd page hasn’t been updated in three months, is it related to this?

    • Somewhat 🙂

      The last code fetch was 2 months ago, but the Analyze Job had a priority that was too low.

      I’ve rescheduled a compete Fetch, Import and SLOC, which is currently running and will update the Analyze Job priority when the SLOC is complete.

      • hoi

        Ok, thanks 🙂

  • Heiya, I am kinda interested on the current status of this. My favourite project https://www.openhub.net/p/smw is still waiting for the next update. Cheers

    • Ah, should have read the posts of the previous days. Thumbs up that things work out allrighty soon. Keep on fighting. Cheers

    • Thanks for letting us know. We did a refresh of the code and bumped the analysis priority up. Should be done in the next day or so.

      • Great. Thanks a ton! Hopefully everything is back in track soon.

  • William Gross

    Hi Peter,

    I’ve noticed that the project I maintain (https://www.openhub.net/p/enterprise-web-library) has not been analyzed for about three months. Could you bump the analysis priority on it?

    Also, I’m looking forward to another Community Day if that is in the works! I enjoyed the event last year.

    • Hi William;

      A new Fetch, Import and SLOC has been done and a high-priority Analyze has been scheduled. Should be a few hours.

      We’ve been talking about a new Community Day as well, thanks for pitching for it!

  • Jim Arnell

    Hi, could you bump the updates on my favourite project https://www.openhub.net/p/generator-jhipster/ please?

  • Oleg

    Hi! The project that I help to maintain (https://www.openhub.net/p/RIOT-OS) has also not been updated for almost four months now. Can you trigger the update, please?

  • kibje

    Our Kodi project hasn’t been updated in 3 months either. https://www.openhub.net/p/kodi
    Is something blocking the update by any chance ? I can work to resolve it.

  • Hugo Arregui

    I noticed firefox (https://www.openhub.net/p/firefox) is 3 months outdated, I don’t know if its related to this

    • Hi Hugo; We’ve long since caught up after this blog posting, but firefox is out of date. We’ll look into it. Thanks for taking the time to point it out

  • Volker

    Hi, my project JasperStarter is outdated since 2015. Can you please run a refresh?
    Best Regards
    Volker

    • Volker

      I misinterpreted – but it is outdated since 6 month.

      • We’ve got the analysis updated with data collected about 6 days ago. Please ping us if you need another update.