In the blog post, Stepping Forward and Back, we mentioned that “we found additional complications with our new back end infrastructure.” We’d like to give you some more details about these complications.
We are referring to an NFS mounted system with enough storage for all 592,000 distinct repositories we track on the Open Hub. Without naming names, we have three problems with the currently installed system:
- It does not support characters that are present in some repositories, thus generating an I/O Error when we try to fetch and update these repositories.
- It is case insensitive by default so files and directories that differ only by capitalization overwrite one another. This impacts a difficult to quantify number of repositories. It would be very expensive to try to compare nearly 600K source directories with local copies in order to identify those that are missing files and/or directories. Our current opinion is that nearly every repository is at risk of being impacted by this.
- Performance through the NFS mount point can be so poor that updates can time out and the server at the source will terminate the connection.
There is an alternative solution (which was actually the system that was requested) available from our vendor without the above issues, but that solution has a hard-coded limit of the number of entries that can be in a single directory. We’ve reviewed existing repositories and have found multiple directories with more entries than the limit, which definitively precludes the use of this alternative solution.
You may be asking yourself why didn’t we detect these problems before committing to this system? I wish we had. We did not because we were not the first team to use this system for this exact purpose and these problems were not detected then. We had used a different system for verifying functionality and performance and were under the impression that the target production system was simply a better system in all regards, unaware that the installed system was not what had been specified. Finally, there were other scheduling pressures that encouraged us to move from our previous 18 bare metal infrastructure to our current VM infrastructure at an accelerated pace.
Here’s what we are doing to fix it: The system upon which we did functional and performance testing is still available and will have more space freed to ensure we will have enough for all our repositories. We are starting the work to relocate storage of new fetches to this new system. Then we will start clean, new fetches of every repository in the Open Hub.
We will keep the existing data until we have had a chance to test every single repository. Right now, we know that there are over 60K repositories impacted by some kind of detectable failure. Most of them are for repositories that have moved and the enlistments on the Open Hub have not yet been updated. We are taking this work as an opportunity to review all those repositories that cannot be cleanly fetched. At the end of this process, we will have clean, local copies of the repositories upon which the Open Hub depends as well as a clear list of repositories that need to be reviewed to see if we can recover the projects that have enlisted them.
Again, we apologize for any inconveniences this may have caused and thank you for your continued support and patience. We are also so grateful that you are a member of the Open Hub and the Open Source Software community.