Update: We’re doing it!

Back-End Background

Here’s a quick summary of the issue about which we will be talking:

In mid-June, we moved off our bare-metal back-end crawlers into a virtualized environment. There were reasonable drivers and pressures pushing us to do this quickly and we found, about two weeks after the irreversible migration, that there were fundamental problems with the SAN storage that were unrepairable. Not only was this SAN unusable, it had caused an irrevocable loss of data quality in all of our repositories.

So we found a back-up for our back-end that would work and started the process of refetching all our repositories.

How Big is Big?

Just what does it really mean, this “refetching all our repositories” thing? The Open Hub is organized around projects. Each project may have zero or more enlistments, which is a mapping of a project to a code location. A code location may belong to more than one project. We currently have 675K projects on the Open Hub, of which 495K have enlistments. Those enlistments are comprised of 594K distinct code locations. Each of those code locations is what we mean we we talk about “repositories”: we have to re-fetch nearly 600K repositories from literally hundreds, if not thousands, of different servers.

We started with the most popular projects, which also tend to be some of the largest and most complex. We had to delete all the old job records and clear out a number of related data elements and schedule new Complete Jobs — Fetch, Import and SLOC (Single Line Of Code counting) — for each repository. We scheduled jobs for the first 300K projects in order of decreasing popularity. That generated some 550K jobs, most of which were Complete Jobs, but there were some Fetches as well (the scheduler has logic to determine which is best).

Completed Work

The Great Rescheduling started at the end of July — July 29 — and quickly moved through 100K or so jobs. Things were looking good. That “back-up for our back-end” is a system from which another team was migrating. It will have ample storage for us when this other team has cleared off their files and it has sufficient storage now for us to have gotten started. But with two teams performing significantly heavy work on the same SAN device, we’ve loaded this system to it’s maximum capacity. As a matter of fact, we loaded it so heavily that we went through a few weeks of the server regularly hanging and interrupting both team’s work (we got the vendor to help us clear out those issues).

Since July 29, we’ve worked through 95K projects, which represents some 128K repositories. Remembering that most repositories will use one Complete Job, but some will require three jobs — Fetch, Import and SLOC, plus the project will have another Analysis Job, we can see how 100K jobs that were reported can cover much fewer than 100K repositories.

This leaves almost 398K projects in need of an updated analysis and just over 3K new projects that have not yet had a first analysis. (It’s nice to see new projects being added to the Open Hub!) Understanding that there are currently 208K jobs remaining (from the original 550K jobs scheduled just about 8 weeks ago) helps explain why many projects have not had new analysis generated in the past two months. New job creation is blocked by the backlog of currently scheduled jobs. Oh, and we’ve manually scheduled updates for many, many projects when folks ask (we’re doing our best to keep up with requests, please drop us a line if you need something updated!).

You see, when the back-end job scheduler is all caught up, as it was before this upheaval, the majority of repositories would have been checked within the service window and did not need to be processed again. That’s when the job scheduler looks for new work to do — it searches for projects with no analysis or an out-of-date analysis and schedules brand new work for all the enlistments in that project. But since there is such a large backlog of existing jobs, the job scheduler never gets to the point of looking for new projects or stale projects. Nor will it until we can get through the backlog of initially scheduled work.

Remaining Work

Going back to the shared SAN: now that this system is stable, work is being performed, but we can see that the load over the past two months has dramatically impacted the throughput. The graph below shows the count of completed and updated analyses by day in the columns. The trend line is a 7-day moving average. The periods of practically no activity were due to us crashing the server.

On September 8, we completed 3500 analysis. Since then we’ve been averaging about 470 per day. This seems to be only due to the heavy use of the shared SAN device, which forms a bottleneck to the process.

Daily Analyses Updated and 7-Day Trend Line

Daily Analyses Updated and 7-Day Trend Line

The other team is nearing the end of their work — somewhere in the 2-3 week range is the current best estimate. And they have begun clearing out directories that have been confirmed as successfully migrated, which is beginning to alleviate the load on the system, so we remain hopeful that the throughput will being to rise again. If we process 3000 Analyses per day, which means another 4+ months to get through all the remaining projects before we can start the updates (which go much faster than the initial fetches). That’s considering the average through to September 8. If we can maintain the more optimal 6000+ Analysis per day, then we’re looking at 2-ish months (after the other team is completely off the shared SAN).

Because the bottleneck is the SAN, but other work can be done, we increased the back-end capacity by 50% to help push everything through (yay VM’s!).

TL; DR

Total Number of Project to Update and Analyze: 495K

Total Number of Projects Updated since July 29: 95K (These are the most popular, which tend to be some of the biggest too)

Initial Number of Jobs Scheduled on July 29: 550K

Number of Jobs Remaining: 208K

Projected Duration to Complete Initial Refetch of ALL projects: 2 – 5 months after the other team frees up the shared SAN, which could be in 2-3 weeks.

Why So Slow: Multiple teams making heavy use of a shared SAN resource. The other team is migrating off of it as we are moving on to it. Not ideal, but it was necessary.

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor
  • Shouldn’t the initial fetch and analysis of each project be done entirely on the local disk, then the results of the project be pushed to the SAN? This would put less pressure on the SAN, which would mostly have write operations.

    • Hi Vincent;

      Thanks for responding with your thoughts and suggestion. There are a number of factors that are driving the current architecture.

      One is that we has an architecture that was similar to this suggestion in the original implementation and it has it’s own set of inherent risks, like needing sufficient local storage on all those servers to support pulls of large repositories and the complexity of needing to pull the repositories back off the SAN in order to perform updates to the repos.

      But the best answer is that this is really a temporary situation, although it has gone longer than we anticipated.

      When we first modified our back-end to work in a virtualized environment, we did all our performance, load and scalability testing on this exact SAN. At that time, it wasn’t subjected to the heavy, constant load of massive data migration and it performed flawlessly. We anticipate that, once the system is not subject to such contention, it will perform again as it did in the past. Plus, at that point, only our back-end will be using it.

      • Hi Peter. Thanks for your quick answer.

        I agree that this is temporary, but that’s why I was suggesting this only for the initial fetch and analysis, mainly for the rebuild of everything: I suppose that this shouldn’t be too complex and would solve the problem with the 2 – 5 months delay, which is a lot of time! Alternatively, if there are projects that are more or less dead, perhaps it may be better to give a higher priority to updates of the most active projects (assuming that such updates should be quite fast so that they wouldn’t significantly increase the delay to complete the rebuild).

        • Excellent suggestion to focus on the most active projects and prioritize that work. I’m working on that now and raising the priorities of the outdated projects accordingly.

          • Sanjeev Gupta

            As an example of a project that is active, and heavily used, can I highlight the Linux kernel?
            https://www.openhub.net/p/linux/enlistments
            has not been updated for some time now.

          • Hi Sanjeev;

            Thanks for the ping. You’re right. I’ve looked at the most recent job and see it failed. Looks like it might be on our side. We’ll get right on it.

  • Jan

    Hi Peter,
    can you retrigger: https://www.openhub.net/p/mpf/? Last run has been over a month ago and I we usually also qualify as a very high activity project.

    • Hi Jan;

      Thanks for the heads up; the project has been brought up to date. Please let me know if it is needed again.

  • Dmitri Zimine

    Hi Peter, thanks for sharing what is happening.

    I am looking at StackStorm and it runs 2 months behind, “based on code collected about 2 months ago” https://www.openhub.net/p/st2 I guess we are just somehere in the queue, rigth? By some strange reasons we need these analytics which we used to grab from here – now – any way to retrigger? If not, I understand and wait at our place in the queue.

    And thanks for running openhub.

    • Hi Dmitri;

      Thanks for the heads up. Actually, there was a recent Analysis on some old repository data, so we scheduled a full refresh and everything is up to date.

      • Dmitri Zimine

        Thanks a lot Peter!

  • Hi Peter,

    Thanks a lot for the update, now I understand what’s going on 😉

    Any chance you can try to kick both openhub.net/p/nextcloud/ (which got sadly stuck again on three repo’s after you commented on the forum post) and openhub.net/p/owncloud/ (which seems 2 years old somehow!)

    It’d be really great if we could have them with a fresh analysis!

  • Arcadiy Ivanov

    Hi Peter,

    Could you provide an ETA for resumed normal refresh schedules? Some of the projects I’m involved with were not refreshed for more than a month.

    • Hi Arcadiy;

      We are down to about 10K jobs from the original 550K and about 300K that were manually and auto scheduled since we started the full re-build.

      At this point, we are looking for projects with old analysis and adding more jobs to try and get them back up to date, but the Job Scheduler should start picking those up automatically very soon.

  • Alexis La Goutte

    Hi,

    It is possible to rebuild #wireshark stats ?
    https://www.openhub.net/p/wireshark/

    • Hi Alexis;

      Sorry for the delay and thanks for the heads up. All caught up again.

  • https://www.openhub.net/p/xoreos and https://www.openhub.net/p/xoreos-tools don’t seem to update either. The latter got analyzed 24 days ago, but the code is from 5 months ago, which is weird too.

  • P van Kleef

    Hi Peter,

    Thanks for the insight.

    Would it be possible to trigger a rescan of the OpenLink Virtuoso (Open Source) Project : https://www.openhub.net/p/_d_4186
    as this is now lagging 6 months

  • Uwe Steinmann

    Hi Peter,

    sorry for bothering you, but https://www.openhub.net/p/seeddms has been analysed 8 days ago but based on code 2 month ago. Looks like human intervention could help. Thanks.

    • Hi Uwe;

      No bother at all, we appreciate knowing about out-of-date projects. A manual update was scheduled and has completed. Please let us know if the project needs another push.

  • Gunther Deschner

    Hi Peter,

    The Samba project (https://www.openhub.net/p/samba) has not been updated for almost 4 months now, would it be possible to check whether there are some manual steps necessary to make it update again?

    Thanks,
    Guenther

    • Hi Gunther;

      Thanks for the heads up. The project is back up to date.

      • Gunther Deschner

        Thanks so much!

  • Dan Kohn

    Peter. could you please unblock and reindex https://www.openhub.net/p/linux

    • Hi Dan;

      We’ve been addressing a low-level issue that had blocked the Linux project from updating. The SLOC job is currently running and is on step 294972 of 648711.

  • Michael Schumacher

    Hi,

    what’s the current status of this topic?

    “$project is not updating” still seems to top the recent forums posts over anything else, so I wonder if the improvement turned out as expected.

  • Barry Smith

    You are still having problems keeping update to date on active projects. Please update spack which you haven’t looked at in 6 months and PETSc in two months.

    • Uwe Steinmann

      Similar problem with seeddms which is just 2 month behind. The project security statistics is also not very recent. Is there anything I can do to improve the situation?

      • Hi Uwe;

        Thanks for the heads up. Seeddms is up to date.

        Could I ask you to contact us at info@openhub.net about the security statistics so we can get some more details from you?

        Thanks!

    • Hi Barry;

      Spack is up to date and the PETSc is running. Thanks for the ping. Yes, we are continuing to have issues with our back end. It’s top priority.