SBOMs, SCA, and PURLs. Oh my!
It must have been a year or so ago when I was looking for an open source vulnerability scanner to use in a project I was working on. As I scoured the internet, I stumbled upon a project called “VulnerableCode” – a server that could run locally and would return vulnerability information if you called its API and gave it a Purl.
What’s a Purl? It’s an abbreviation for Package URL and it identifies a component that’s used in a software we build. Think of it like a hyperlink that contains metadata such as ecosystem, name, version, among other things…
Why is it so important? It’s quite simple. If you have a component Purl, you can query a vulnerability database and get a list of CVEs that affect that component.
So we can think of a Purl as a key of sorts – and it shows up everywhere in a Software Bill of Materials.
Anyway, let’s get back to the story.
The project I was working on? It was a little proof of concept CLI that would eventually become “bomber” – one of the first open source SBOM vulnerability scanners. I started prototyping using VulnerableCode but then moved on to vulnerability APIs that were available online, but I always wanted to return to VulnerableCode someday.
That day came in December last year when a new issue was created in the bomber project on GitHub. It was titled “Fetch Data from VulnerableCode” and was submitted by one of its creators, Philippe Ombredanne. When we finally connected via email a few months later, I found out a few very interesting things about Philippe.
First, he invented the Purl.
Second. He’s a long history with SPDX, CycloneDX, and Software Bill of Materials.
Welcome back, to daBOM.
Hey everyone, welcome back daBOM. I’m here with Philippe Ombredanne.
I’m Philipp Ombredanne. I’m French-American based in Europe. I’m the maintainer of a bunch of tools in the space, which are software composition analysis scanners. I maintain a vulnerability database called Vulnerable Code. The overall project’s called AboutCode (aboutcode.org). And I’m also the CTO of a startup called nexB (nexb.com) that support these tools and, and provide services and, support around them. And all of these tools, of course support PURL.
You reached out to me via email and said, “Hey, I see got this podcast about software bill materials.” Let’s start off with PURLs because anybody who’s using the software bill materials knows that PURL is that identifier for those open source components.
What’s the story behind PURL? Where did the idea come from and what is it?
It’s a way to give a name to a software package to make it simple. The idea is to have something which is of use that you can just look at and say, “Oh, this package is Log4j, yes, of course.” This naming convention is something that you can use after that to query database; is there none vulnerabilities or exchange data between software tools and so on.
That proves to be pretty useful and I’m really excited because it’s used nowadays pretty much everywhere in the industry.
Folks that don’t use PURL today will use it or are working on using it and implementing it. I’m a bit, infatuated with it, of course, but I feel these days it’s becoming really to be the most important glue across tools and standards and specs in the space of SBOMM and cybersecurity.
I first heard about PURLs at Sonatype . One of the engineers said to me, I was like, what is this PURL thing? And he said, it stands for package URL. It’s a pretty strong identifier for where things are and where things reside.
You mentioned how you’ve been taken aback by just how it’s proliferated through the industry. Creating something like that and seeing where it’s gone, how do you feel about that?
In a way I don’t think I invented much. It’s more reusing ways that were already present and just standardizing it a bit. Back in roughly 2017, we have this tool called ScanCode, which is a scanner, pretty popular and considered as the leading one for license detection for instance. We were starting to parse software package manifests. There was a simple issue, which is we were passing NPMs, Maven and PyPi and Ruby Gems and all that. We needed an identifier of sorts to be able to reference this.
I had stumbled on the project from Google called Grafeas (grafeas.io). They were working with an Israeli company called JFrog, which had the product called X-Ray. And they were working closely with the folks from Grafeas. They had this way to use URLs where they were using, for instance, like instead of HTTP in the URL, they were using npm colon slash and the name of the NPM at its version.
And I said, wow, it’s super cool. We should use that! We should use that across the board. And the folks from Grafeas were participating a bit, but they were not super interested, nor were the folks from JFrog. Eventually I started a bunch of issues to have pretty intense discussion on what they could be.
We were blessed by having participation of, some of the core designers of HTTP and, the worldwide specialist of URLs to chime in and guide us, which came to what we have today. The original need was to say, I need something for ScanCode. Rather than do it on the side, just, involve as other folks and pretty quickly, and I think that was really essential to the success.
Around the beginning of 2018, I made a presentation at FirstDev in Brussels, Belgium. Steve Springett from CycloneDX came. There were a lot of contribution folks from Sonatype, folks from the Open Source Review Toolkit, another project that started using that at Red Hat. I think being able to surrender the control and spread the good with others was really important and essential to the success of this.
We aggregate vulnerability data from any data source. But instead of just keeping them side by side, we are merging them and combining them and, we’re correlating them. We have jobs to improve on the data on the regular basis.
We’ve built a small tool on top of vulnerable code called VulTotal. It’s like virus total for vulnerabilities. You query it with a PURL and it’s going to look in vulnerable code in OSV dot dev. It’s going to look in Sonatype, in Snyk, whatever is available, and compares for this package version. What are all these database saying about vulnerabilities?
It’s actually pretty funny. It’s both funny and scary because not one vulnerable database agrees on what is the range of vulnerable versions for given vulnerability. We’re not even talking about security, we’re just talking about vulnerable versions.
Sometimes the output comes in and there’s three CVEs that are exactly the same with different severities and different information. It’s confusing, but you know what, for the people who look at it, there’s different information there.
Eventually what we’re trying to do with vulnerable code is have a way to aggregate all this data and evaluate it so it’s better. It’s all open source and also open data. You cannot buy commercial subscription, because all the data is free.
We’re working right now on the way to make sure we can efficiently replicate the data and share creation activities. But some of the things we find out, which are really weird in doing these verifications, we found advisories provided by GitHub that reference package that don’t exist. Not a big deal because in the end nobody use these packages, so there’s no harm down. But it’s a bit confidence shattering to see that there are packages that have versions that don’t exist.
The whole idea there is trust, but verify.
How did you get involved with the open source community and all these open source tools that you’ve been creating?
It’s been a very directed effort. I used to work in 2001 in a large consulting organization and I was let go after September 11. I started a company with a bunch of friends. What we wanted to do was to build an open source ERP.
We came from the enterprise software world as consultants. We saw the writing on the wall that open source would be something that would just spark the world. That was 20 plus years ago, right? We said let’s build an open source enterprise resource planning application. That was completely nuts. We never did that, but eventually, we’re building IDEs and tools for larger software teams.
One day there’s a guy that I had met in San Francisco, Bay Area. He says, Hey, you guys know thing or two about open source, I heard. I’m buying all these software companies. I’d like you to help me figure out what’s in the code of these software companies. And I was like, okay, I have no idea what I’m talking about. But in the land of the blind, the one eye rules.
We did the first gig. We found tons of weird stuff and code of weird origin, weird licensing. Within six to eight months it became our primary business.
Being tools developers very quickly we scratched our itch and started building tools. Eventually we were working to build a full, open source, open data suite that covers software composition, static analysis, and all the underlying data. And we’re trying to go primary data as much as possible.
We are sourcing vulnerability straight from the Nginx schema. They have their own feed of vulnerabilities. I’m pretty sure nobody ever tried to do that before because it took us months of back and forth discussion with the Nginx maintainers to understand what is the data structure they had in mind.
These guys are brilliant, but nothing is of use, no document. So getting the primary data I think is really important there.
At our scale, we’re bootstraps are funded. Adding a full open source, open data solution is a way to exist in the space. I think it’s helpful, provides healthy competition for the larger commercial operations.
Tell me a bit about when you have all the components and you have all the PURLs, how do you exchange those? How do you put those into a document, let’s say.
I can tell you how we did at first. Our first BOM was a Word document with a table in it. Very quickly, we upped our game to move to much more sophisticated tools like
Excel. Nice. that sounds as a familiar story of hey, where is this thing? Here’s an Excel spreadsheet. Fill it out.
Excel spreadsheet are still nowadays super useful, and they’re still a tool of the trade for a large number of people in exchange in the field. So much so that people are always looking for tools when they have an SBOMM to convert it back to some spreadsheet in some point of time.
You’re talking about a BOM. You have a BOM as a document with the table. You’ve moved to excel. Where do you go from there?
So next step was of course write to JSON. We were involved very early on with the co-founding SPDX at the Linux Foundation. Actually the, my co-founder at nexB, Michael Herzog, actually coined the name SPDX, which means we were there reasonably early.
We didn’t have as much bandwidth I wish we had early on to help support the efforts. We’ve been quite involved on the license definitions, quite involved on the standard side, but not as much as we wish we would have been able to do.
The SPDX evolved, primarily with roots in software licenses, software license compliance, because there were a lot of academics involved early on. People coming from the Java world, people really in love with XML and RDF. There’s been a lot of emphasis on these formats early on.
It’s always been difficult for me to understand, the value of the complexity you get in the document, like an RDF document. I’m more of a command line person, so when you have all these tags all where around, it’s a bit difficult.
Fast forward today, definitely still Excel spreadsheet. But more and more for me, I just leave aside, every format that’s not JSON and or YAML, which is essentially the same.
I find the approach of CycloneDX extremely refreshing, because Steve Springett and the team, they have been focusing on the extent of what matters. I feel there’s quite a bit of baggage that’s still carried by SPDX. I love SPDX, I’m one of the co-founders, so I won’t bitch about it, but there’s still lot to say about the simplicity of a standard that’s driven and grounded in practical software development.
Steve built CycloneDX to support a practical problem he had with Dependency Track to exchange, ingest, information about software composition between many tools. In that sense, software came first and the standard followed. It’s not been historically like this, yet, for SPDX, where the standard came first and the software followed, which my personal inclination is towards software first and standards second.
Let’s get into the differences between the specifications. We have SPDX, very verbose. And then we have CycloneDX, which is more streamlined. Tell me what the differences that you’ve seen developing some of these open source tools and using both these formats.
Recently there was a large Java code base about a million files. The SPDX JSON output was 360 megabytes. The CycloneDX output was 1.4 megabytes. The difference was providing the file details, and of course, you don’t have to put that in SPDX. We’re making that an option because it’s practically impossible to use a 400 megabytes file. It’s reached the points of diminishing returns very quickly.
The big difference here is one approach, which has been focused more on providing a lot of details. And another approach, which is to provide the effective information you need. If you remove the file level details from an SPDX document, both are about the same.
What’s also different is how CycloneDX emphasis is towards package URLs and PURLs. Whereas it’s more just like a side thing and reference for SPDX .
I cannot stress how important it is to have clear origin information for code in an SBOMM. If you provide an SBOM which is from a package URL or something similar in which can pinpoint exactly where the code comes from, so at minimum, say download url, then I would say an SBOM is mostly harmless and useless in this case.
It’s very dangerous because I think there’s a tendency, because of the pressure, there is some, say, some, federal government, organization to demand SBOM for their vendors, there’s going to be a lot of shitty SBOM that’s going to go up there, which will follow the minimum requirements from CISA and everything by the letter, but will be completely useless as a tool to convey the origin of the code that’s used that can then be used to ensure that you don’t have a major cataclysmic event like the 2021 Log4j thing again. Or at least if it happens that we can deal with it in a more efficient way than we did back then.
When we look at the size and the file information in there, when is that important? When is it important to have files and file hashes in an SBOM?
I really don’t think it’s much important at all. The only case that may be useful in some case if you want to have detailed license and copyright information at the file level. But even then it could be aggregated.
A Java package that comes from Apache, and you have a thousand files, and each of these ones have the same exact notice, license, copyright. What’s the value of repeating that a thousand times? It’s pretty low. If you want to have like super high assurance environment, maybe you’d want to have file level details, but I’m not sure super high assurance has to be supported as a use case by SBOM.
There’s a bit of a magical thinking to believe that an SBOM will be able to carry both. The sense of what matters for cybersecurity and software composition and every single details about your software from the beginning of times. I think it’s a dangerous approach to believe we can put everything into same format, same package.
Talking about licensing, are these two projects similar but different and trade off each other.
We’ve discussed with the format of SPDX and it has some, heaviness by default. It’s not mandatory, but the default is geared towards lots of file details.
There’s one thing which is a wonderful thing in SPDX is the license list and the SPDX license expression syntax, which is a way to convey a clear and simple identifier for the license. It cannot be stressed how important license is in open source, right? In a way, open source is defined by open source licenses. It’s actually not in a way, it is defined by licenses.
The reason for success of open source and open source licenses is that there are few of them, and you can put a name. If I say M-I-T, three letters, you know exactly what this means.
Try to do that with a commercial license contract. Try to describe a commercial license contract with three letters. Good luck. This ability to have a new language which described license is super useful. It’s been well designed and codified by SPDX and it’s been adopted, therefore by CycloneDX, as it should. Eventually every package environment, every software producer should adopt this small standard for describing the license, which would make thinking about what the license of this stuff is a problem of the past. So you could think about policy, about license, but not about what is the license. We’re still today in the world of “What is the license?”, which is very wrong.
Why is knowing the license of components and files important?
If you don’t have a license, you’re not allowed to use it. As simple as that. And if you stumble on the random repo on GitHub, it doesn’t have license, the only thing you’re allowed to do is look at it and that’s it. Essentially no license means no rights to do anything with it.
You have to be careful because there’s really nasty stuff sometimes. There are problem for commercial entities and problem for open source projects as well.
Being able to have clarity and licensing is important. When I choose software, I hope when you choose software, when everybody chooses software packages to use and combine in your tool, application or system, you should care about what’s the license? Is it compatible with my projected usage? Is the project active? Is it well coded, tested quality code? Does it have known vulnerabilities, of course. But all these come together and in the decision to decide do I want to use this software or not?
When we start looking at open source components, we want to have as few of them as we can in our software, as few versions as we can in our software. There’s a whole bunch of different things that we have when we start selecting components.
One of the things I like to do is make a decision and say, can I write this, just as fast and essentially without using a third party library, using the standard base foundation classes or the standard libraries of the language I’m using.
From a practical perspective of SBOM, so as a developer of open source software and especially from tools that produce and consume SBOMs, what are the benefits for you as a developer using the CycloneDX format?
The other day I was trying to transfer a bunch of package information from one tool to another. What I did is literally I exported to CycloneDX, SBOM on left, imported it on the right, and I was done. I was surprised it worked, but it worked really nicely.
I’ve been able to import an SBOM from SYFT. I wanted to do some comparison of how good we fair in ScanCode IO, compared to see for container SBOM creation. I was amazed I was able to import it, right off the bat.
We have to be really careful to ensure we have standards that are decreasing in complexity. Eventually, there’s fewer moving parts and fewer opportunities to do things which would be hard to process down the road.
So transfer data from one system to another. I was like blown away. Literally, I could do that. We had written the code to important export on both sides, but never really tested this way, for practical purpose.
But some of the extensible things like external reference drives me crazy because that could be anything. From a software manufacturer or developer perspective, open source or commercial, you can put anything in any field in an SBOM and hope for the best. There’s really no standards of what things go where.
Am I wrong in saying that?
You are a hundred percent right. I’ve participated in a few events organized by SPDX, which were called DocFEST, where you had open source and commercial producers and consumers of SBOM banning together for a couple days trying to make sure they could read and write each other SBOM. Frankly, there was not two tools that would produce the same output given the same input.
That’s just going to be a nightmare for the federal government to manage if they’re getting things in different formats.
I don’t think it’s getting better because it’s not enforced. And the only thing to get a bit sanity there is to have, strict validation, which goes below and beyond the standards to have fewer, optional parts. I’ve seen literally SBOM created by responsible people with great tools and they were completely incompatible, created for the exact same input software. That’s a real problem.
It’s going to have to come soon before we really get too far. And that might be why the government said, Hey, here are the minimum requirements for software bill materials.
Even if you look at these minimum requirements, they’re both under specified and over specified at the same time.
For instance, the insistence of having a supplier, not super useful. And I don’t think you really have to provide a clear reference like a PURL or download URL, which identified the software uniquely.
I found in trying to meet the minimal elements, required by the CISA to have both difficulties to meet these requirements at the same time, and to find other under specified, leading to an SBOM, which wouldn’t be enough.
The point that I want to stress here is that if you don’t provide a good identifier, a good way to go to the software in question SBOM is useless. It’s going to be super difficult to have many commercial software vendor to agree to that. But if they don’t, the user of software, which is recipient of SBOM, they could be given toilet paper that would be as good.
You mentioned something interesting about PURLs in the minimum spec and that we don’t necessarily need those. That sort of contradicts what we said earlier today about having PURL as an identifier for that specific component.
You can go in some case as a second way with just download URL if it’s unique and stable.
in fact, for say zlib, which is on zlib.net, there’s a git repo maintained by Mark Adler, but for a practical purpose, there’s a cononical place to get the software and there’s no package manager for that. It’s available elsewhere, but there’s no package manager.
So what you do in this case is you can create a generic PURL and you want to provide, download URL with that. You’re going to call that generic zlib at 1-2-3 or 1-2-18, and you put the qualifier saying, download url, zlib dot net 23 Z. And there you go.
There’s a value to this. As part of the Python library that I maintain for package url, there’s a small contribution, which is called URL to PURL, and PURL to URL. Eventually I want to create a service and more libraries and expose it so it could be translated in other language, but the idea is to have a bunch of heuristics. So you get a URL as an input, and you get a PURL or many PURLs out.
For instance, if I give you a URL to the Maven Central Java repository or to NPM or to PyPi, or to GitHub, you can infer a lot of the PURL that should be corresponding to this, and the other way along.
We were talking earlier on about the minimal requirements for SBOMs from the US government. There’s one thing I found difficult is the notion of relationship and dependency between components. In practice, say you do the binary analysis of a product and you’re able to infer an SBOM based on that. You will unlikely know anything about the dependency and the relationship of this package.
You have a flat list that’s used in the product. I find it difficult to have systematically these relationship between packages and components, to be a mandatory requirement. I don’t think it’s useful in practice. If I use Log4j my product, it doesn’t matter whether it’s direct dependency or 20 level remove dependency.
It may help if I do remediation, but otherwise, license requirements, bugs, security issues, they’re the same, whether they’re my doing or someone else in my dependency graph doing, it’s all the same to me.
The one thing that I would recommend is if you’re interested in where this is going, definitely participate in the CISA working groups. SBOM.gov is the URL there.
Philippe, where do you see SBOMs going in the next five years?
Oh, that’s a tough question. I see the effective exchange of software composition data as leveling the playing field for everyone in the space. Eventually software composition analysis tools will have to morph to provide something else because it’s going to be, if it’s not already a commodity as it should. That’s going to be probably the biggest trend.
On the other hand in this space, there’s a lot of smaller vendors. Buyers don’t want to have to deal with lots of vendors. There’s going to be consolidation also, that I’m sure.
But there’s a lot of huge amount of work to do still on data side. I should not have to just think a second about the origin of license or license of a piece of code I use. I should have actionable metadata, ideally in the format of an SBOM, but something I can trust. It’s not so much trust in the crypto sense, I don’t care about signatures there. I want to need to be trust because it’s been created, reviewed by a community of people I trust.
Having shared commons for data is going to be an important frontier. I’m trying very modestly to contribute to this with two projects. One’s called PURLdb, which is eventually a database of a reference formation about package URLs and everything that’s behind and VulnerableCode for vulnerabilities.
I wish there were more efforts in this space. Being able to establish these data commands are really important. A lots of the marketing pitch of commercial software vendors in the space is who said they have super duper special, private, premium data. I don’t know if you’ve seen this movie, Glen, Gary, Glen Ross, I’m sure you have. We’re about the same generation. They have their premium leads. There’s an addiction to private data about security, which has been very unhealthy because you don’t want to put a tax on oxygen. What’s the difference if you have the super secret vulnerability information that you’re providing as part of voluntary database or volunteer service and the guy that sells an exploit.
Conceptually it’s the same. You’re trying to take advantage of dis-symmetry and asymetry in the information. And you’re trying to keep this information to you, for commercial gain. In the domain of security, this should disappear. There’s lot of opportunities for software companies to compete and provide interesting solutions.
Competing on data and keeping this data private to their chest is total nonsense.
It’s like a disservice to society almost. Just hearing you talk is making me think, man, I’ve got to crack open a laptop and start contributing more to open source software tonight. to start getting coding.
Lawrence Lessing was a professor at Stanford University, is the founder of Creative Commons, has been talking a lot about what you call the tragedy of the commons when it comes to open source.
It’s the idea you have a shared field in the village where everybody sends their cows to graze. It’s shared, so everybody’s taking just a tiny portion. Except when you send all the cows after that, there’s no grass anymore for everyone’s to benefit there.
That’s the risk with this software, vulnerability information, especially, it’s even weirder when you think that this is data about open source code, which is open in the first place.
So right now I have a project for instance where, that’s really interesting, which is to try to find two tiny bits of information. One is given a package with a known vulnerability, what is the fixed commit that fixed the vulnerability in this package? Interesting. Is that or the patch?
Then the second steps going to be to analyze the call graph using static analysis and the data flow to figure out, oh, we have these two methods that need to be called and they can be called only through this pass. Then you can store this information. You have the patch, you have the call graph and entry point.
And third place is going to be doing static analysis on the code, your code, my code to figure out are we, potentially vulnerable. Which is something which is not only by commercial company. I know Snyk does that. There’s a bunch of startups which have been started in that, but I think it’s interesting to do that as open data.
If there’s a few people that help us support and sustain this effort, the more the better. It’s really important because we see security teams which are swamped by the volume of poorly created, poorly qualified vulnerability information and being able to extend that to, “is this vulnerability reachable?”
It can help a lot if you are absolutely positive that there’s no way you reach this code because yes, you use Log4j. Yes, it’s vulnerable, but there’s no code path that goes from your code to this code. That’s one less fire that you need to extinguish and you can put your efforts in a different directions.
This episode of daBOM was created by me, DJ Schleen, with help from sound engineer Pokie Huang and Executive Producer Mark Miller. This show is recorded in Golden, Colorado, and is part of Sourced Network Productions. We use Captivate.fm as our distribution platform and Descript for spoken text editing.
You can subscribe to daBom on your favorite podcast platform. We’re going to be releasing a new episode every Tuesday at 9:00 AM. I’ll see you next week as we continue to diffuse daBOM.
Philippe Ombredanne is a passionate FOSS hacker on a mission to enable easier and safer to reuse FOSS code.
He is the maintainer of ScanCode, the industry standard license detection tool and other open source tools for software composition analysis and license & security compliance at https://aboutcode.org
Philippe contributes to several other projects including the Linux kernel SPDX-ification; the SPDX and ClearlyDefined projects, strace, several Python tools, and previously to JBoss, Eclipse and Mozilla. Philippe has been also a long time Google Summer of Code mentor and org admin. Work-wise, he is the CTO of nexB a company that helps software teams track what’s in their code with DejaCode, an open source governance and compliance dashboard.