Hail Fellow Open Hubbarians! You have been marvelously patient, for which we are eternally grateful, as we have been cleanly implementing the Ohloh UI in a new code base on the latest production versions of Ruby and Rails. Here is how events transpired over the past half year: We first talked about Project PURR in early January, then we identified there was a sizable problem with spammers in February, and in April we confirmed have indeed closed new account creation. We also shared how we were having performance issues due to excessively slow database queries.
During this time we worked diligently to keep the site up and running while trying to shrink the backlog out-of-date projects. We even shut the site down to clean out old cruft in the database. That was very helpful and dramatically reduced the number of out-of-date analyses and support records, and unburdened the system sufficiently so that we caught up with just about all the projects and were looking to get any outliers cleaned up.
About two weeks ago, we lost disks in two of our 18 crawlers. It should not have been a problem because we keep a supply of disks on hand and we had two remaining. These disks are not manufactured any more, are hard to find, are expensive, and when we do find them, they are refurbished. You may not be surprised to learn that one of those two disks was bad and the RAID array was not able to rebuild. And then we learned something else about our infrastructure. Our crawler architecture is highly robust in terms of jobs being interrupted, killed, and restarted. The design is sufficiently robust so that should one of the workers on a crawler is not available, the others can proceed without interruption. Unfortunately, the architecture cannot stand the loss of a full server.
Each server, as part of normal processing, will check code and push code to other servers to ensure we have at least 2 copies of each repository. We tried running the remaining crawlers with the damaged one shutdown. Within a few hours, all 17 remaining crawlers were hung while waiting to get status from or push to the missing crawler. We tried restarting the processes repeatedly, and would get some processing to complete, but the vast majority of work would just block and stay hung. The upshot is that there has not been any crawling or project updates in over 2 weeks. This is very disappointing especially since we had just gotten ahead of the database performance problems that had been problematic for a while.
So we shut down all the crawlers. New disks were on order and because the RAID is configured as 10, all should have been well. Two weeks later when the new disks arrived, the RAID controller could not rebuild the array with the new disk. We’re looking at full data loss, but most critically, we still cannot do any crawling.
We are working on multiple plans:
- Remove the missing crawler from the database so that other crawlers don’t look for it
- Adjust the code so that crawlers are more robust when one of the other crawlers can’t be reached.
- We are redesigning the crawler code so that we can move it off of the 18 crawlers and into a virtual environment that relies on a SAN for a single, highly reliable copy of a repository (no more duplication across crawlers!)
The key take away is that we cannot run the crawlers until we get past this unanticipated failure. No crawlers means no new analyses. This goes for projects, people and organizations.
Concurrent with this trouble, we are close to releasing the results of Project PURR. This should be a near seamless transition, with the exception that you will eventually need to log in again. After PURR is in production, we will re-enable the ability to create new accounts. New accounts will be created using Twitter Digits and to ensure consistency across the site, existing accounts will be required to update their login credentials using Twitter Digits. We understand that this may be unpopular with some account holders, but we hope all will understand that external verification of an accounts’ validity will help to reduce the amount of spam on the Open Hub and make the site more useful for legitimate users.
We are running final tests and expect to direct some traffic to the new application in the next few days.
Once again, thank you all so very much for being part of the Open Hub community and for your extraordinary patience while we address critical infrastructure aspects in order to deliver features that will help the Open Hub continue to provide value to the Open Source community.