Open Hub in 2017

Hail Hubbites!

We’d love to share some of the things that have been going on and will be going on here in Open Hub Land. We accomplished some very significant work in 2016 and would like to take a moment to lay it out and then talk about what we’d like to accomplish in 2017.

2016 Review

Please recall from our 2016 Review what we did in 2015: rebuilt the UI, addressed spam account creation, improved back-end performance (5X in some cases), started inventing new security data features. The plan for 2016 was to create a new Project Vulnerability Report and Project Security Pages, run the Spammer Cleanup Program, virtualize the back end (the FISbot project), switch to Ohcount4J, connect to other sites related to OSS.  Here’s how we did:

  • Invented the Project Vulnerability Report algorithms and presentations
  • Prototyped Project Security Pages with the (now closed) security.openhub.net pages
  • Deployed FISbots and Ohloh Analysis onto virtual servers (this involved migration some 10TB of OSS project data from multiple servers to a single SAN)
  • Started running batches of accounts through the Spammer Cleanup Program.  To date, we’ve cleared out some 350,000 spam accounts (YAY!!)
  • Design and implemented a Prototype Project Security Page to report known vulnerabilities in OSS projects.  Collected user feedback from that experiment
  • Explored using Ohcount4J instead of Ohcount.  Decided to stay with Ohcount.
  • Added a feature to add an entire GitHub account to a single Open Hub project
  • Numerous back end improvements and defect resolutions to consistently delivery web pages under 200 ms (6X faster than 2015 on average)
  • Defended against a number of malicious attacks against our API service and web site (comes with the territory of running a non-trivial web application, amirite?)

There’s more though!

The FISbot was implemented as a stop gap measure to address issues we had with the back end bare metal crawlers. We were waiting for another project to provide a central set of Fetch, Import, and SLOC services to the Black Duck enterprise. The plan was to shut down the FISbots and use this other service.  However, after deploying our FISbots, it was decided that we should expand the FISbot to handle the additional enterprise scenarios.  So, completely unplanned at the beginning of the year, we implemented the eFISbot Project, which we also delivered last year.

Last point: as we talked about in Detail on the Infrastructure post, the migration of that 10TB collection of OSS project data onto the production server ran into serious issues that forced us to re-fetch every one of the nearly 600K code locations we monitor.  This was a serious multi-month disruption, from which we have mostly entirely recovered.  We have re-fetched all the repositories, but there are lingering issues in getting all those repositories and corresponding projects refreshed in the 24 – 72 hour window we’ve set for ourselves.

So, in summary, we’ll add to our 2016 Review:

  • Implemented and delivered eFISbot
  • Survived the treacherous NFS SNAFU and the Great Code Location ReFetch

I feel it is also important that we mention again the passing of our friend and colleague Pugalraj Inbasekaran in February. I still feel his absence as an ache near my heart and miss him.

2017 Plan

We have a few main focuses for 2017

  1. Make the back end screamingly fast
  2. Make it wicked easy to add projects from GitHub to the Open Hub and get data from the Open Hub into your GitHub pages
  3. Continue the UI update with wider pages and more responsive layouts
  4. Add new languages to Ohcount

For that back end, we’ve been given permission to obtain a new set of servers.  Currently, the Open Hub runs off a single database (we’ve talked about that over and over again).  We’ve put in a purchase request for 2 database servers that have over 4X the CPU cores and 9X the RAM. One server will be the master and the other the replicate. These servers will support only Fetch, Import, SLOC and Analysis operations (write intensive) so, we’re calling this the FISA DB.  The current database will remain with the purpose of only presenting generated analysis (read intensive) through the Ohloh-UI application, so that will be the UI DB.  We are SO VERY EXCITED!!! SQUEEEEEE!!! Ah. Sorry; sorry. Please excuse the author (but it’s SOO exciting!)

As always, thank you so very much for being part of the Open Source Software community and your continued support of the Open Hub.

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor
  • Some projects aren’t updated for months, despite commit activity. Examples are SimThyr,
    PUMA Repository and even large projects like Lazarus and the Free Pascal Compiler. Is this behaviour intended?

    • Hi Johannes;

      You are correct. There have been a lot of changes to our back end, from our architecture to significant infrastructure change and the back end is behind in getting projects updated. It’s our top priority and we’re working on it.

      In the mean time, don’t be shy tweeting to @bdopenhub with projects you’d like us to update.

  • Tadeusz Chelkowski

    This is the single most important site about OSS development. Thank you for doing great job!

    • Thank you for your kind words.

      • I agree it is a great way for people to check how healthy the project is they are looking for. I’m working on a blog where I point that out, actually 😉

  • Dan Kohn

    Hi, Peter. I just confirmed with Greg Kroah-Hartman (the Linux stable kernel maintainer) that your numbers for the Linux kernel are way off. Specifically, your 12 month numbers for # of commits are 5x too high and # of developers are 3x too high. The wrong numbers can be seen under the 12 month and 30 day statistics, neither of which make any sense: https://www.openhub.net/p/_compare?project_0=Linux+Kernel&project_1=Kubernetes

    Correct numbers from Greg’s processing of git for 2016 commits are: “Processed 70067 changesets from 3846 developers”. Roughly the same numbers are available analyzing GitHub’s data for free using BigQuery with this query: https://gist.github.com/dankohn/09e8ce913685b4f794a04dd6ffa2e783

    You have some big bug in how you’re dealing with Linux.

    • Greg Kroah-Hartman

      Or maybe I just messed up my calculation somewhere? I like the huge numbers posted here if that’s what we are really doing, maybe I’m overlooking some random commits somewhere else? 🙂

      • Dan Kohn

        Linus made the point in his talk in Tahoe last week that a huge amount of code gets written for the kernel that gets rejected, as multiple developers try different approaches simultaneously, and only one approach eventually gets chosen. So, maybe you’re getting credit for all of that extra code. https://www.linux.com/news/event/open-source-leadership-summit/2017/2/video-linus-torvalds-how-build-successful-open-source-project

        • Greg Kroah-Hartman

          Oh, if only people could start tracking all of the failed patch submissions, that would be wonderful. On average, I only accept about 1/3 of the patches sent to me. But that still wouldn’t account for this large of a discrepancy…

      • Hi Greg;

        We verified that the two Enlistments contained the exact same commit history, so we removed one and generated an updated analysis.

        May we ask for your perusal of the numbers to see if they are more in line with your expectations? The most recent analysis is based on code retrieved 18 days ago.

        Thanks so very much!

        • Greg Kroah-Hartman

          Yes, that looks much more realistic, thanks for fixing this.

    • Hi Dan; Thank so very much for raising this. We’re monitoring two code locations (https://www.openhub.net/p/linux/enlistments); perhaps that is incorrect. We’ll take a close look at the matter.

      • Greg Kroah-Hartman

        The github tree should be a mirror of the kernel.org one, so if your scripts are correct, and you are looking at git commit hashes, they should be fine. But why track two different trees anyway? Just use the kernel.org one, that’s the canonical source tree for the kernel.

        • Thanks so much for the info; this is very helpful.

          If I recall correctly, it was switched to GitHub because we were having problems reliably fetching from kernel.org. We may not have been aware they are mirrors. Commits are scoped by Code Set, which is a snapshot of a Code Location. Different Code Locations will always have different Code Sets and hence, there will never by a commit hash collision.

  • Pingback: About the FODS Architecture | Black Duck Open Hub Blog()