Here’s a quick summary of the issue about which we will be talking:
In mid-June, we moved off our bare-metal back-end crawlers into a virtualized environment. There were reasonable drivers and pressures pushing us to do this quickly and we found, about two weeks after the irreversible migration, that there were fundamental problems with the SAN storage that were unrepairable. Not only was this SAN unusable, it had caused an irrevocable loss of data quality in all of our repositories.
So we found a back-up for our back-end that would work and started the process of refetching all our repositories.
How Big is Big?
Just what does it really mean, this “refetching all our repositories” thing? The Open Hub is organized around projects. Each project may have zero or more enlistments, which is a mapping of a project to a code location. A code location may belong to more than one project. We currently have 675K projects on the Open Hub, of which 495K have enlistments. Those enlistments are comprised of 594K distinct code locations. Each of those code locations is what we mean we we talk about “repositories”: we have to re-fetch nearly 600K repositories from literally hundreds, if not thousands, of different servers.
We started with the most popular projects, which also tend to be some of the largest and most complex. We had to delete all the old job records and clear out a number of related data elements and schedule new Complete Jobs — Fetch, Import and SLOC (Single Line Of Code counting) — for each repository. We scheduled jobs for the first 300K projects in order of decreasing popularity. That generated some 550K jobs, most of which were Complete Jobs, but there were some Fetches as well (the scheduler has logic to determine which is best).
The Great Rescheduling started at the end of July — July 29 — and quickly moved through 100K or so jobs. Things were looking good. That “back-up for our back-end” is a system from which another team was migrating. It will have ample storage for us when this other team has cleared off their files and it has sufficient storage now for us to have gotten started. But with two teams performing significantly heavy work on the same SAN device, we’ve loaded this system to it’s maximum capacity. As a matter of fact, we loaded it so heavily that we went through a few weeks of the server regularly hanging and interrupting both team’s work (we got the vendor to help us clear out those issues).
Since July 29, we’ve worked through 95K projects, which represents some 128K repositories. Remembering that most repositories will use one Complete Job, but some will require three jobs — Fetch, Import and SLOC, plus the project will have another Analysis Job, we can see how 100K jobs that were reported can cover much fewer than 100K repositories.
This leaves almost 398K projects in need of an updated analysis and just over 3K new projects that have not yet had a first analysis. (It’s nice to see new projects being added to the Open Hub!) Understanding that there are currently 208K jobs remaining (from the original 550K jobs scheduled just about 8 weeks ago) helps explain why many projects have not had new analysis generated in the past two months. New job creation is blocked by the backlog of currently scheduled jobs. Oh, and we’ve manually scheduled updates for many, many projects when folks ask (we’re doing our best to keep up with requests, please drop us a line if you need something updated!).
You see, when the back-end job scheduler is all caught up, as it was before this upheaval, the majority of repositories would have been checked within the service window and did not need to be processed again. That’s when the job scheduler looks for new work to do — it searches for projects with no analysis or an out-of-date analysis and schedules brand new work for all the enlistments in that project. But since there is such a large backlog of existing jobs, the job scheduler never gets to the point of looking for new projects or stale projects. Nor will it until we can get through the backlog of initially scheduled work.
Going back to the shared SAN: now that this system is stable, work is being performed, but we can see that the load over the past two months has dramatically impacted the throughput. The graph below shows the count of completed and updated analyses by day in the columns. The trend line is a 7-day moving average. The periods of practically no activity were due to us crashing the server.
On September 8, we completed 3500 analysis. Since then we’ve been averaging about 470 per day. This seems to be only due to the heavy use of the shared SAN device, which forms a bottleneck to the process.
The other team is nearing the end of their work — somewhere in the 2-3 week range is the current best estimate. And they have begun clearing out directories that have been confirmed as successfully migrated, which is beginning to alleviate the load on the system, so we remain hopeful that the throughput will being to rise again. If we process 3000 Analyses per day, which means another 4+ months to get through all the remaining projects before we can start the updates (which go much faster than the initial fetches). That’s considering the average through to September 8. If we can maintain the more optimal 6000+ Analysis per day, then we’re looking at 2-ish months (after the other team is completely off the shared SAN).
Because the bottleneck is the SAN, but other work can be done, we increased the back-end capacity by 50% to help push everything through (yay VM’s!).
Total Number of Project to Update and Analyze: 495K
Total Number of Projects Updated since July 29: 95K (These are the most popular, which tend to be some of the biggest too)
Initial Number of Jobs Scheduled on July 29: 550K
Number of Jobs Remaining: 208K
Projected Duration to Complete Initial Refetch of ALL projects: 2 – 5 months after the other team frees up the shared SAN, which could be in 2-3 weeks.
Why So Slow: Multiple teams making heavy use of a shared SAN resource. The other team is migrating off of it as we are moving on to it. Not ideal, but it was necessary.