Do You Want Some GUAC with that SLSA?
I read an interesting post on Twitter the other day about Software Bill of Materials. The author said “SBOMs promise a picture of what lies beneath the surface of software, but without large scale automated binary analysis, at best, they reflect intent not reality. As a result, relying on them is like being an explorer without a compass.”
The author does make some good points here. Large scale binary analysis is definitely lacking in some regards – but the technology is there to do it, and we’ve had a guest on the show that has talked about how they’re doing it today for mobile apps.
But binary analysis is only one use case. There’s so much more to Software Bill of Materials.
As for the compass, even as late as the 1700’s European explorers still used AstroLabs. They helped navigate using the stars, and although the compass was invented around the same time in Asia, it was only used as a backup to the Astrolabe.
What that shows is you don’t need to have a compass to be an explorer.
Just like you don’t have new technologies without innovators like Tim Miller. He’s one of the folks behind Guac – and that’s an acronym for “Graph for Understanding Artifact Composition”. It’s an open source tool that aggregates software security metadata into high fidelity graph databases.
What does that mean? It means that it ingests SBOMs and provides a way for users to query that information.
Tim reached out to me after seeing Guac as part of my SBOM Reference Architecture” in a LinkedIn post that hit his feed. After getting on a quick call to discuss what I had planned for Guac I knew I had to get him on the show.
What do we do with SBOMs after we get them? Buckle up, because we’re going to talk about one thing you can do…
Welcome back, to daBOM.
Welcome back to daBOM everybody. I’m here with Tim Miller from Kusari, one of the co-founders of the GUAC Project. I had done a presentation earlier this year at RSA and you pinged me and reached out to me and said, “Hey, what’s with this GUAC icon here?” And then we had a conversation. I’m like, we gotta get you on the show to talk about all things GUAC, some of the practical things around managing and dealing with SBOM.
Tell me a little bit about yourself and what your role is today.
Sure. I spent most of my career in financial institutions of various sorts. The majority of my career is spent at Bridgewater Associates. It’s a hedge fund in Westport, Connecticut. I was there for about 12 years.
My first half of my tenure there was doing trade algorithm development and then when AWS became a thing, we really wanted to get into the cloud and enable the devs to move fast. However, with an algorithmic trading firm, the source code is effectively the intellectual property of how hedge fund thinks about trading. There’s absolutely no acceptable risk around that source code.
So our task there was to build a dev environment that could track what people are doing, enable them to move fast, not get in their way, and understand where that code that they wrote went, where it got packaged, what touched it in the middle. If you squint and step back, that’s what folks call supply chain security today. That was our first venture.
Then moved over to a bank where I led the innovation team and again, found a similar problem where you’re trying to move away from bureaucracy based security into actual security, and you’re very much forced into that same mechanism. Then I went over to Citi to work with folks like John Meadows and run the deal with the supply chain security problem there.
And again, we found a very similar problem. We just found that no matter what tools we had in the ecosystem, despite having access to all the tools in the ecosystem, we’re still left with a difficult problem of understanding where everything was in the first place. As an engineer, that can be really hard, particularly when you’re in a panic mode.
That’s really what we left and started Kusari for and started working on the GUAC project.
Tim, tell me about industry adoption and support for the project. There’s some big names in there, right?
Yeah, there are. Primarily the backers of the project are listed on the website, but in no particular order, Google, Kusari, Purdue, University, Citi, Yahoo, Red Hat. There’s quite a bit of folks on the big tech list that are really jumping into this with two feet.
What is GUAC and what does it do?
GUAC is a knowledge base effectively of how all of the artifacts in your entire supply chain are put together. It’s in the form of a graph. GUAC stands for Graph for Understanding Artifact Composition. That is a backronym because we were working with the SLSA specification, and GUAC is a funny name to go along with SLSA. The goal is to allow folks to make more practical decisions around their supply chain data.
So is it a storage mechanism? Is it a query mechanism? What does it give from a value perspective to people who use it?
What it’s doing at its heart is ingesting metadata about your application to understand how all those things are put together and then allow you to trace those dependencies that come either from direct dependencies or indirect dependencies or transitive dependencies, no matter how deep they might be so that you have a very clear understanding about how all of your dependencies in your entire chain work together.
It’s doing a couple things. One is it’s reading all that information and lining it all up with each other. The typical use case here is when Log4j came out a year and a half or two years ago. One of the difficult problems in solving the Log4j issue is understanding where exactly it is so that you can get rid of it effectively and be sure that it’s not going to pop back up again.
What GUAC can do is it’ll read all your information, put it all together, and allow you to pivot your questions around, not just saying, “Hey, when I scan my artifact, what’s in it?” That really only solves one particular view of one particular build that you’re looking at now as opposed to an organization trying to get rid of that thing entirely.
It’s not only in that one thing. You don’t only want to rely on your scanning schedule to do that. It’s assembling all that data into a graph, providing you the API layer to query your supply chain and answer those questions.
Do I have to be a scientist to use this and install it? How do you start lifting this off the ground?
Getting it started should be relatively straightforward. There’s a base level of knowledge needed with things like containers and general ecosystems, but overall it should be relatively straightforward.
The basic process is you instantiate the infrastructure behind GUAC. There’s various binaries for the different pieces to just run on any architecture. That should only take a handful of minutes to get spun up. And then from there you’ll have some software metadata that you would like to ingest into GUAC. And there’s a couple command line tools that are part of those setup scripts that come with it.
Your next step is to ingest something like a Software Bill of Materials into GUAC. And then from there you can start exploring and answering some questions that you might have. That would come in the form of querying.
We’re working at all times to make that barrier of entry as low as possible. We don’t want this to be complicated. We don’t want this to be something really difficult and we don’t want you to have to be an SBOM expert or even know what that might be in order to use this.
Our whole goal is to enable folks to make a lot of these complicated queries very easy. We’re constantly working to bring down that barrier of entry. Getting started should be pretty quick.
Take for example, the scenario where you have automated CI/CD pipelines, you’re generating Software Bill of Materials. Is that a great point to start sending things over to GUAC for storage analysis and introspection or inspection?
That’s a great point to start. So if you’re generating an SBOM, which many folks are now, you can just add a step in your workflow to send that over to GUAC, or if you’re storing them in a repository, GUAC can read that too.
We’re basically trying to enable different mechanisms for either it to sit there in the background and poll your different repositories that might exist or to trigger it directly in something like a workflow like you’re describing. Both of those methods work really well.
Thinking about storage and you said that GUAC can get things out of a repository or some central location or you can send things to it. Is that like a GitHub repository or S3 bucket? Probably both not really good places to store large documents like an SBOM right?
This topic in particular is where we and the GUAC team might have a slightly different perspective than a lot of the other folks on where the important pieces around SBOMs lie.
Here’s where you’ll start to see a lot of conversations around how do I distribute SBOMs properly, or where am I keeping them. I don’t really think that’s the important question or the important activity that you want do. You as a user of something like GUAC are trying to answer questions about your supply chain and SBOMs are a means to an end to answer those questions, to do something different than you’re doing today.
So you have a question that you can’t answer. SBOM is one of the starting points to get that information, but once you have it now you don’t necessarily need that SBOM anymore. Likely your SBOMs are going to evolve and change and they’re going to be somewhat dynamic in nature.
You want to continually evolve your picture about what you’re dealing with, continually manage your understanding, and take advantage of that data and then do something with it. That’s the whole point, is to do something with all this data. And that’s really where we focus.
So let’s dissect that a bit, storage not being the prime use case. That actually makes a lot of sense because these documents are huge. If we’re getting them every build, we’re going to be paying millions of dollars for bucket storage on any cloud platform.
So when you take these SBOMs in and you ingest them, you say you dissect them. Do you toss the original SBOM out or is it something that, we don’t need anymore and we only keep the latest information about that component?
At the moment, that is what we do, although that’s not necessarily the intended. That’s just where we are on the field at the moment. We are working to preserve the link to that original SBOM so that the source of your graph has a data starting point, but not necessarily using GUAC as a distribution mechanism or at least the primary use case of it. That would be more like a side effect or something else that it can do.
When you look at what’s in that SBOM, you’ve got all sorts of data and all sorts of interesting pieces in there. The SBOM is just the starting point that lets you then go explore all sorts of other things. What GUAC will do is it will put it into its own data model and then it will go start looking at other data sources that we can pull in to amend that information in the SBOM to enrich what your original picture is.
Things like OSVs so the vulnerability databases, things like devs dot devs, or information about how things are put together as well. OSSF Scorecard information has not only vulnerability information, but is this thing good to use? Is this a thing that I should use? Are there any quality risks around it?
And continually pulling and searching those public data sources for all sorts of other information with the goal of giving you as much information about what’s in your supply chain to identify risks and reduce them.
Interesting. So it’s starting to look like GUAC is definitely this analysis tool for getting information out. You’re using a graph API so that information could be pulled into other systems?
Ideally, what you’d be using GUAC for is to have it run behind the scenes so that it’s enriching your existing workflow in an easier way. Not of the opinion that throwing another tool in your face is going to be the way to make you more successful in reducing your supply chain.
However, there’s a lot of information in your existing tooling that could really benefit from how GUAC views it. Long term, we’d love GUAC to be integrated with all sorts of other systems in the backend, to make the friction as low as possible so you can just drop it into your environment and have it instantly just make things better and be as adaptable as we possibly can to how folks work.
What kind of questions have you seen people ask, or what kind of questions do you think people should ask when they’re looking at getting information about their Software Bill of Materials and what they have?
There’s a couple questions that folks have been asking and some of our demos point to this, but the easiest example is, my project is the vulnerable version of Log4j in my supply chain somewhere, which is not necessarily something that you can answer just by looking at your direct dependency list or even just the SBOM at the first layer. You need more information.
That’s a good example of what to find that a lot of people practically still care about, hey, do I have that thing somewhere?
Some of the questions that folks are starting to ask that we can’t quite answer yet, but we’re on the path to is, do I have anything in my supply chain with a CVE score of X, within the last, Y timeframe. To start to answer the question, hey, what’s my evolving vulnerability picture look like?
Separately, I think the other thing that folks are looking for is license information too. Hey, is GPL somewhere in my stack that I’m not aware of? That’s a really practical question that a lot of folks would love to answer. Things like that is what we’re working to be able to identify, not only vulnerability information, which is usually the common topic that comes up with SBOMs, but all sorts of other things.
What kind of questions have you been hearing, things that you don’t necessarily support yet, but sound like good use cases for GUAC and the development going forward?
There’s a bunch of really great, like higher level questions that folks are starting to ask and it really gets around a couple things that we’re circling around but can’t quite do yet.
There’s a big auditing kind of use case or a diagnosis use case around what information did I have at a particular point in time. When something was either exploited or when something was popped on the news as vulnerable, did we know that at that time, did we have all the information to do something about it and didn’t, or did we not know that it was vulnerable at that time? And this was a totally reasonable thing.
There’s this going back in time and understanding who knew what, when use case that I think is really interesting, that a couple folks are starting to ask around. That’s not only auditing, but also just again, like root cause diagnosis to understand, hey, what might you want to change in your process or who did what differently to improve yourself? but that’s a big one that I think is really interesting that I’m excited about because it adds a whole set of information that we’re not dealing with yet in terms of, temporal data and all sorts of things like that.
But, I love that one. That’s, that one’s a really exciting one.
What other information can we get out of things like GUAC? We get a SBOM from our vendor, we pull it in, we do some analysis on it, and we say, Hey, 50% of the components that they use are over five years old and three major versions behind.
Are there some decisions that we can get out of that that say there might be a higher level of risk to acquiring software from this vendor because of that?
There’s all sorts of metadata in that vein that GUAC can start to answer for you that either come in the use case of I’m inspecting someone to assess the risk of using either this particular artifact that they provided me, or using artifacts from this vendor entirely. Maybe you’re building up a pattern there.
There’s a combination of this metadata that GUAC can ingest versus some metadata you can add specifically into GUAC through things like the CLI to mark things as bad or mark things as vulnerable, and then provide some better decisions.
One of the things that we are looking at as well is identities and different types of certificates that might be present on different vendors to allow you to mark certificates as bad. Ok, I don’t trust vendor X, but I do trust Yahoo. Their stuff is really good. I’ll always just by default allow something signed by Yahoo’s key to come in. That’s something that we’re looking to amend in that metadata picture as well.
So how has the executive order and some of the new recommendations, mandates, policies, whatever we want to call them, in the public sector, government, how has it affected where GUAC is going and any use cases that you’re thinking about when you’re designing the product?
Initially it didn’t affect too much because the general reaction we saw in the beginning was, Hey, I need an SBOM. How do I generate it? How do I make one of these things? Which spec am I supposed to use? Which tool am I supposed to use? They all look different.
The initial reaction was much heavier on the generation of SBOMs. Now I think folks are starting to come into that, okay, now what do I do with it? And that’s exactly what we were trying to position ourselves for, is now that you have all these things, what is the point of any of this, at all? It’s not just a check the box to generate something, which is super important, but now hopefully you can use that.
We’re starting to now see, what sorts of questions do I even want? We’re still seeing a lot of curiosity and confusion and exactly how folks can do this, because I think it’s still relatively new for a lot of people and how they think about managing it. Because it allows you to be far more proactive than you typically could.
It’s a really curious, exploratory environment that we’re starting to see outside of things like FinTech and insurance areas, which are have a little bit more specificity in how they would like to approach this problem.
What kind of industries do you find are attracted to GUAC and have that maturity where creating an SBOM is a commodity. Now creating a well formatted SBOM is not, we’ll get into that in a second, but what industries have you seen really start reaching out about GUAC and for using the technology?
We’ve seen a lot of FinTech, so a lot of, hedge funds and similar sorts like that, that are very curious about it. They have a similar problem to what I touched on before when we’re working back at Bridgewater, is there’s a real need to actually secure things in those environments; the risks are very high, the funding tends to be very high and they move very fast. There’s a general pivot towards using more information much more quickly in an environment like that as opposed to something more heavily regulated or controlled. That’s been a big one that we’ve seen.
Similarly, the insurance agencies as well have been asking quite a lot about this, for that who knew what, when, use case, what can we prove? What can we not prove? Was the decision here or the outcome reasonable? Should this be covered by a policy? There’s all sorts of questions that they’re starting to ask around as well.
Those are two big ones, as well as big tech has been quite interested in this. Folks like Yahoo, Google. Those places are really quite far ahead and are generally forward leaning anyway, and they’re looking to really leverage this information to build policies on top of this data so you can programmatically start to shut stuff down as opposed to use a more typical release process to gate a lot of this going forward.
You guys are like right at the intersection of theory versus practicality to an extent. We can theorize and talk about the philosophy of SBOMs and compare it to the manufacturing industry and automotive industry, et cetera, et cetera. Let’s address the differences between the specs and formats. What’s the biggest challenge ingesting these things?
Our current viewpoint is that SBOMs, all the specs, have a similar issue in that they have both too much and too little information at the same time. In terms of a valid SBOM, in any spec, it can have almost nothing in it. From an ingestion perspective, that incredible level of flexibility on the specification makes reasoning about them quite difficult.
What exactly does a valid SBOM contain if it has no information about the software. We’d love to see some progress on all specs addressed, as well as there’s all sorts of information in there that from a software perspective doesn’t really help that much.
What exactly is an SBOM describing is the biggest problem that we’re seeing as a tool that’s ingesting and trying to reason about the contents of them.
One of the biggest nightmares that I see is when I see no assertion and 99% of the fields that are in an SBOM. If there’s no assertion, why are you putting it in there? You’re just leveraging space. What are your thoughts on no assertion or no attestation?
It’s a similar problem to attaching vulnerability data in there. The difference between an SBOM and then the different assertions that you make about the concepts of the SBOM at any point in time, have different time windows associated with them. It can be difficult to necessarily put them all together into one document because you’re inherently linking things with different time windows.
If I didn’t change my product whatsoever, but a vulnerability was identified, is that technically a different SBOM or is that more metadata about my SBOM? Is that something I should link or something embed.
Just in general, we would love to see a bit of a separation between the data about the SBOM itself and the SBOM. Cause I, I do think that they can be different. I realize I’m also touching on a bit of a hot topic for a lot of SBOM folks, that’s something that we’ve found quite a bit of struggle with, trying to reason about them really well.
It’s a constantly evolving threat landscape. We should keep these separate, to an extent. Now, what does that do for a tool creator when you have SBOMs and then you have all this metadata. How for example bring in VEX or VDR, which could potentially, change every 10 minutes?
We would love to be able to separate those problems, but that’s the model that we would like to operate in. Separating these things out makes our job easier for a few reasons.
One is there’s a lot of concern, like we mentioned before, in the storage of SBOMs, that they can tend to be pretty big depending on the size of the artifact that it’s for.
Because it can potentially contain anything and everything, this is where you start to have to make everything optional to prevent SBOM’s file size explosion. If you can be a little bit less flexible, a little bit more specific, and then we can start to optimize around how we might deal with the serialization of things like SBOMs without worrying about all the different ways things can potentially hook in.
Separately, the mechanism by which you might do constant scanning or assertions around the contents of that SBOM would occur potentially different time frequencies than you ingest the SBOM anyway. It could really allow us to optimize things and how we think and how we architect some of the ingestion mechanisms in GUAC and some of the certification mechanisms in GUAC as opposed to having to treat it all as one big step.
I like the way you’re talking about digesting the information and making it approachable because nobody’s going to be able to look at a JSON SBOM in any non-technical way and say, oh, I understand. Even technical folks are going to look at that and have their eyes roll back in the back of their head. That’s where GUAC needs to separate things and actually make it digestible as well.
Tim Miller: Yeah, for sure. I would advise many of the folks in the space to do is step back and think about the questions that you’re actually trying to answer with any of these. The use cases will help really understand what should be in there and what sorts of things you even need a bill of materials for. There could be a shift of empathy towards the folks who need to consume these things a little bit more to understand how they would like to consume them, what questions are on their mind, and what problems are they facing with the existing tooling that they have so that you could tailor the information to really help them out.
There’s a little bit of lack of focus is who’s doing anything with these. In theory, there’s a lot of really great stuff that you could produce, but what can somebody do right now, what pain are we solving.
Do you think that the concepts of SBOMs are going to fade out a little bit? thinking about just over the past couple years, we’ve had a lot of hype around SBOMs.
And don’t get me wrong, I think that they’re a fantastic mechanism for sharing and for inventory and security all at once. But are we going to get distracted as an industry? Do you think it’s going to be something that keeps getting momentum?
There’s always going to be a bit of a shiny ball syndrome in tech no matter what’s going on. Folks are going to pivot and look, but then at the end of the day, you still have these problems to solve. And no matter what, you’re going to see folks circling around how do I answer these questions better? How can I improve my risk posture? Right now, I think SBOMs are a great way to do that.
No matter what, you’re going to see folks circling back to this. But again, the real thing to focus on is what questions are people trying to answer that they can’t today? Where’s the pain? Why can’t people secure their supply chains reasonably well now? That will continually push you down the path of things like SBOMs.
Knowing the information in there as a really important step to solving these things, that can help feed things like the AI overlords to help us make better decisions, but you’re still going to need that data. You can’t make those decisions without the data. You can reverse engineer things, you can dissect them later. But, there’s lots of use cases about needing that information before, like a CVE database is even updated; you know something’s bad. This is going to continually be something that folks need to do. it’s a necessary function for folks to operate securely.
Where do you see GUAC supporting all this five years down the road?
We’d love to see GUAC being the source of truth for what your supply chain looks like. From there, there’s all sorts of things that can be built on top. So we’d love to see GUAC be the foundational layer that a lot of this additional tooling uses to make these better decisions. Having GUAC be positioned as just the source of truth for how everything is connected and where you’d need to go to find something from there, there’s all sorts of possibilities that I’d love to see where the creative juices flow.
So for the listeners who are saying, okay, I know about GUAC, I know about SBOMs. I can ask some questions. How should they approach the problem and say, before I even download GUAC or tell my developers this, what do I need to think about, what questions do I need to frame. Take me down the road of getting to the point where I’m achieving value out of GUAC and having it installed.
I would start by saying don’t necessarily think about GUAC. I would say, what pain do you have in your environment today? Where really are you struggling? Think about the information management problems that are affecting you.
Are you getting a list of CVEs to fix that have nothing to do with you? Are you starting to migrate away from something that you don’t understand how to do that? Are you struggling to fix a vulnerability somewhere and then start to think about what the source of that problem is?
if you find that it’s related to understanding how things are put together or where certain things may be located, GUAC’s a great way to help you solve those problems. Really just focus back to the problems that you have no matter what you’re doing, and then see how GUAC can help lift you up in that journey.
Separately, if you’re just curious about SBOMs and want to see how things like that might look, you can run GUAC, you can run the GUAC visualizer, which will help you understand and see the tree or how things are put together and really open your eyes in terms of how complicated some of that really is.
I’ve seen a lot of outputs from some of those visualizers and it’s pretty complicated. We’ve got to figure out how do we drill down into those and get some more data. Get some more knowledge, right? Cause It’s looking at a molecular structure almost. GUAC turns into the microscope at that point to an extent.
It does. It’s either the microscope or the telescope, depending on how big your graph is. You can find some interesting things. Like you can see does all your dependencies come to one single point and what’s that thing? And who’s maintaining that thing? And do I know that person? Do I trust that person? Do they work with a competitor? There’s all sorts of interesting questions you can start to know to ask now once you see it.
Seeing is believing and I encourage everybody, kick the tires on this thing. Again, it’s open source. What’s your GitHub repository name?
The easiest way to find it is on GUAC.sh, which is our open source projects website, which links to GitHub and has all the docs and demos and examples and how to get started.
How can people help out? What are you looking for from some of our technical listeners who are like, Hey, I want to get involved in this. I want to contribute. How do we start there? What kind of things are you looking for?
Oh, all sorts of stuff. Primarily what we’re dealing with at the moment is there’s all sorts of different persistent backend options that we’d love some help with. Doing this at scale, performantly is quite a challenge, particularly when they’re in a graph form.
We’d also just love general user feedback and, Hey, this thing doesn’t do what I need it to do, or have this question that I can’t answer with GUAC, how do I do that? That’s some of the most helpful conversations we ever have.
Any additions into GUAC, we’d really love. We highly encourage folks to submit pull requests and add the features that they’d like to see.
But I think it’s really going to start with making this much more performant at a really large scale, which is a significant engineering problem if anyone’s interested.
The more ingestion, the merrier. There’s all sorts of different things that we could be ingesting and putting into that graph and amending the data with that we’re not today, and we’d love some creative thoughts and help on the code for that as well.
Tim Miller is a technical leader with over 20 years experience in the Financial industry. He has lead engineering efforts on critical trading systems, and has been focusing on the supply chain security problem for over 10 years at some of the world’s most secretive companies. Outside of work he loves spending time with his family, cooking, and balancing hard work with being a goof ball. It’s a hard balance.