Researching Project Security Data

This started with a message from the outstanding Marc Laporte about the Project Security Data for the Tiki Wiki CMS Groupware project. Marc took what looks like a healthy amount of time to carefully document the discrepancies and areas of confusion around the security report. In kind, we’ve taken a deep dive into the data.

The Problem (in Brief)

Marc highlighted a few problems.  The first was that we were missing versions.  We were able to address that problem.

The more complicated one is that there are discrepancies in the versions reported affected by a vulnerability as well as inexplicable ordering of the versions.

The Explanation (Not Brief)

The author sat down with one of the senior member of the Black Duck Software (BDS) Knowledge Base (KB) team to look at the data being presented and to start unwinding the trail of data production back to its beginnings.

We looked at the data in our KB, the channels through which those data are obtained, and looked at how we have gotten and are getting those data as well as what we are doing with it.  The issue mostly boils down to “dates are hard.” Note that we’re not talking about engineers getting dates — that’s a different topic altogether — but how a non-trivial system discovers, identifies, and documents dates that are connected to important events such as version releases of software.

Our story starts some 15 years ago in the early days of Black Duck, when ad hoc Open Source Software (OSS) standards were few and the forges were fewer. BDS engineers were interested in getting information about OSS fundamentally for license compliance. Releases were important, but licenses were more so. Dates were captured when available, and typically from the date stamps of files after syncing files locally, but there wasn’t a focused interest on the dates of releases. It seemed like good metadata to have and we like metadata.

One obvious challenge to this model is when a team uploads a body of work onto a forge. Different file date stamps can be lost from the original system and replaced with the timestamp at which time the files were created on the new system. At this point, the KB sees a number of release tags all with the same time stamp.

We layer onto this the reality that projects were often duplicated on different forges or through different release channels; for example, the source forge and the project’s download page. Over the years, the KB Research team has performed tirelessly and relentlessly in hunting down and correcting duplications and merging projects together. Please recall that the KB tracks significantly more projects than the 675,000 projects we track on the OH. All that said, we believe we have an opportunity to re-examine the merge logic and attempt to improve the dates selected for version releases and have opened a ticket to do that work.

However, one of the most complicating factors is that we don’t always know about all the releases in spite of these methods and learn about a release only when it’s mentioned in the CVE report. When this happens, we create a record for the release and, in lieu of any better information, record the date we learned about the release as the release date.

Add into this particular challenge that vulnerability feeds will often state that a vulnerability affects “this and all previous” versions. What exactly does that mean? Is version 6.15 before or after version 8.4? When one is confident in the dates we have for version releases, we can use those to determine what came before, but as we just examined, one cannot always be confident about such dates. What about applying the vulnerability to previous versions across branches?  For example, a vulnerability affects 3.6 “and all previous versions.” We would all likely agree that impacts version 3.5, 3.4, and 3.3 — all previous 3.x versions.  But what about version 2.10? Was that really affected as well? What about all the 2.x or 1.x releases? It just isn’t clear.  And, what if the vulnerability was in a component the project used? That isn’t clear either from the available data feeds.

Oh, and we should mention that vulnerability feeds, such as the NVD Data Feeds, change over time.  For example NVD version 1.2  provided this “affected versions” identifier, but it was dropped in version 2.0, although we expect that it will return in version 3.0.

The takeaway is that we think we can do something in the short term that might help clear up dates to make them more reasonable, but the real fix will come from improved efforts on identifying the actual versions that are affected by vulnerabilities so we can do away with blanket policies.

This is why Black Duck is making a concentrated effort to provide effective information on OSS vulnerabilities. We’ve assembled a dedicated research team that is focusing on this problem to ferret out in greater and more reliable detail the true relationships between vulnerabilities and the OSS projects and versions affected by them.

About Peter Degen-Portnoy

Mars-One Round 3 Candidate. Engineer on the Open Hub development team at Black Duck Software. Family man, athlete, inventor