By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Evaluating and Scoring OSS Packages

Written by
A photo of George Apostolopoulos — Engineer at Endor Labs.
George Apostolopoulos
Published on
June 4, 2024

The explosion of open source software (OSS) has created an abundance of options. As everyone that has used Amazon knows, abundance can sometimes be a bad thing. One key problem in the context of OSS is that of curation: it is important to select “good” OSS dependencies and avoid “bad” ones. And again, similar to Amazon, the popularity of OSS has also increased opportunities for various actors to create intentionally bad, i.e. “malicious” OSS packages in an attempt to compromise an organization. 

In this article we will look a bit deeper into what it means to tell “good” from “bad” dependencies — because it’s going to vary depending on each unique situation — and how we can go about solving this problem. Even though it looks pretty straightforward at a high level, it gets complicated if one looks at the details. 

The Curse of Abundance  

First, we should clarify that the analogy with buying products on Amazon only goes so far. Unlike Amazon products where we typically do not know how the physical product looks and feels, there is a lot of information about OSS packages, including their source code, who else is using them and so on. When dealing with an Amazon product, the best bet is to look at other customer’s reviews to try to figure out “good” from “bad” products. This “user rating” model has not taken off much in the OSS packages space, possibly because there is no single Amazon marketplace for OSS packages. The OSS work is fragmented across different ecosystems that are fairly independent in their practices, toolsets, and ways of communicating. 

Another curation model would be that of consumer reports, where a trusted third party evaluates various products and makes a recommendation. This has happened to some degree for OSS packages with Google Assured Open Source Software but still is not broad enough to provide an industry wide solution although it eventually may get there. A similar industry-wide effort is that of the OpenSSF ScoreCard. This model is similar to the “nutritional label” model used in the food industry, where a somewhat detailed list of ingredients along with relevant nutritional data (like how much of a given nutrient exists in the food) is provided so that consumers can make educated choices. The nutritional label model is even more relevant for OSS since it is an example of how regulations can enforce certain solutions. Recent software supply chain attacks have drawn a lot of regulatory scrutiny on OSS that has already produced multiple recommendations and requirements. 

Since there is no widely accepted solution, several vendors (including Endor Labs) offer their own system of evaluation of OSS packages. Unfortunately, as in the case of food, the list of ingredients can get very complicated, making it pretty hard to make decisions based on this information. In order to understand these offerings, we need to understand much more about how we evaluate OSS dependencies. 

Security and Development Definitions of “Good OSS”

The first complication is that what is “good” or “bad” depends on the eye of the beholder, and in our case, the two main groups of beholders are the security teams and the developers. Developers are the main consumers of OSS software, but the security teams are typically tasked with handling the risk that comes through the use of OSS. Unfortunately the priorities of each group are somewhat different. 

A security team’s main focus is to reduce the risk that comes for OSS, while developers typically care about the functionality and the features of the OSS software. What developers may find is a very “good” OSS package may be pretty “bad” in terms of security team defined risk. For this article, we mostly focus on the security team perspective. 

Operational vs Security Risks

Security’s main goal is to reduce risk, so we would expect that a “good '' OSS package is one with “low” risk. Risk has been extensively studied in many contexts and has multiple definitions. When it comes to OSS, a typical approach is to separate risk into operational and security risk. We define these categories as follows:

  • Operational risks can endanger software reliability or increase the efforts and investments required to develop, maintain, or operate a software solution.
  • Security risks can result in the compromise of system or data confidentiality, integrity, or availability.

An example of operational risk around an OSS package is when a package is not maintained. When choosing an unmaintained package, we accept the risk of not being able to get support from the community for fixing certain future issues. As you can expect, there is no single definition of operational risk; multiple types of issues can be considered operational risk, from packages with inappropriate licenses to protestware. 

Typically security risk means that an OSS package has a known vulnerability; using this package exposes us to the risk of someone using the vulnerability to attack the organization. Again, lots of other issues can be considered parts of security risk; from bad security practices when developing the code (e.g. lack of signed commits) to using other, high security risk dependencies. 

To learn about the top risks, check out OWASP Top 10 Risks for Open Source Software.

Elements of an OSS Evaluation Solution 

To construct a solution that can evaluate whether an OSS project carries risk, there are several elements to consider:

  • Data sources
  • Facts about the packages
  • Types of risk
  • Scores vs policies

Data Sources 

The first step in solving the problem is how to map the ingredient list of OSS packages to operational and security risks. But before we can do that, we need to make sure the ingredient list is complete and accurate. We can derive information from multiple sources, including:

  • The source code of the package: Usually hosted in a software development platform like GitHub or downloadable from a package manager. With access to the source code, we can retrieve valuable information like licenses, secrets, results of  SAST and other types of analysis.  
  • The CI/CD pipelines used to build the package: Pipeline configurations  reveal a lot of information about the tools and policies used when building the package releases and the development hygiene of the people developing the package. 
  • The configuration of the development environment: For example the policies that control how developers contribute code in public GitHub repositories in terms of authentication (e.g. a 2FA requirement), ability to merge PRs without review and so on. 
  • The development activity that goes into the package: In the form of  commit, issue, PR and release activity
  • Package stats from the package manager as well as the source repository: This includes numbers of downloads, number of GitHub stars, clones, etc. 
  • The dependencies for each package: Here Endor Labs leverages its powerful analysis capabilities to get accurate lists of direct and transitive dependencies.  
  • Social aspects of package development: Who are the developers contributing to a project? Which other projects do they contribute to? And so on.
  • Public and proprietary vulnerability databases: To determine if any known vulnerabilities affect a specific package. Endor Labs has invested a significant amount of R&D into creating a proprietary, highly curated, and enriched vulnerability database. 
  • Discussions about the package: Conversations in various public forums 
  • Outputs of other security products: Results from scans run on the package (e.g. Dependabot PRs or GitHub security event logs) 

This is an extraordinary amount of information! Which is great because it provides lots of visibility. The next challenge is how we can reduce all this detailed information into some easy-to-consume representation of risk. 

Summarizing Through Facts

With a comprehensive ingredient list in place, the next step is to summarize the raw information. Endor Labs’ approach is to extract various types of “facts” from all the data. These facts are simple, deterministic, and easy to prove. A good example of an easily verifiable and indisputable fact is “there were 10 commits to the GitHub repository of this package in the last month.” There are lots and lots of such facts that we can collect from the underlying data. The Endor Labs team makes sure that each fact has enough information associated with it, both describing what it means and details of how it was computed and related evidence. 

These facts are not always trivial to compute; we will revisit this in a second. The facts themselves are basic information about what is going on with the package which anyone can figure out by poking around in the GitHub of a package, the package manager, or searching on the internet. But this is a lot of work to do manually. Endor Labs just computes all these facts automatically, and does so continuously to keep data up to date as projects change. 

Having a detailed granular list of facts is of course good, but once you have it you need to be able to make some sense of this information, otherwise it just becomes information overload. 

Risk and Types of Risk 

And this leads to the bigger challenge: how can we map these facts to the risk of an OSS package? To solve this, we need to add some more detail to the various types of risk. At Endor Labs, we use the subcategorization below to reason about risk.

Operational Risks 

  • Activity: How actively developed a package is. The underlying assumption is that packages that are actively developed are less likely to become abandoned or unsupported over time
  • Popularity: How widely a package is used. The assumption here is that more popular packages are again less likely to be abandoned over time. This is in a sense an approximation of the “user rating” model 
  • Licensing: Legal risk of using software with licenses that do not align with the organization’s policies  
  • Dependencies: Some of the risk is coming through the dependencies, e.g. having unused dependencies, or having too many dependencies. In many cases  importing very large and complex packages can also bring additional risk in the form of potential vulnerabilities or bringing lots of transitive dependencies. 
  • Code quality: Potential issues with the code both in terms of performance and bugs. Information about whether maintainers are  using tools for tracking code coverage, the amount of tests found in the source code, and so on.

Security Risks

  • Vulnerabilities: Risk from known vulnerabilities in the package. Vulnerabilities can have different scores and severity, and  may or may not have a known fix.
  • Security related practices: This covers best practices in the use of the source code management system (for example requiring reviews from multiple developers before merging, using 2FA, limiting the creation of public repositories, having proper contacts for reporting security issues and so on. 
  • Use of security related tooling: Whether package developers use security scanning tools like SAST, DAST, or Fuzzers.  
  • Malicious code: Cases where the package may have been created as a supply chain attack vector that will attempt to compromise developers when they import it.
  • Secrets: Accidentally leaving secrets in the source code is a pretty common security risk that can have significant fallout since it allows attackers to easily compromise various internal services. 

This is by no means a complete list. It is always possible to extend or add more granularity, but in our experience this is a solid base to build upon. 

Now that we have introduced this structure, things become more clear. 

  • Each of the facts we compute will map to one (or more) of the above risk sub-categories. 
  • A fact can have a positive or negative implication for the corresponding risk. For example, a lot of recent commits is a positive fact when we consider the “activity” for a package, while having unfixed vulnerabilities for a very long time is a negative for the “security” of a package. 

Building a comprehensive set of facts can benefit from more data. For example, Endor Labs implements secret and license detection which we then can use to drive the fact computation. In order to track vulnerabilities, we developed a proprietary enriched vulnerability database that contains curated and enriched vulnerability information. Furthermore, Endor Labs scans the source code of OSS packages so we can detect bugs and other code issues as well as find patterns that may indicate that a package is malicious. 

After this analysis, we summarize all the raw data into a smaller, well defined set of facts and map these facts to one or more of the risk sub-categories. This is a lot of progress, but the problem now becomes how to act on this information. 

The Problem with Package Scores

The instinct of every security product out there is to compress all this information into a score. Having a single numerical score is convenient since it provides a simple summary and allows for comparisons. 

But a key limitation of a single numerical score is that it compresses too much: a single score can not really capture effectively all the nuances of the underlying information. You can view it as mapping a complex multi-dimensional space into a single line; there is a lot of information loss involved as information is weighted and averaged in multiple ways. This invariably results in two questions from customers:

  • How do you compute the score?  
  • Can I change the formula you use since your formula prioritizes things differently than what I have in mind?

These are legitimate questions since anyone who wants to use these scores must understand how they are built, and rarely the assumptions built into the score computation will match the priorities of the organization.  

A Better Way to Evaluate OSS: The Endor Score

At Endor Labs we compute scores but rather than compressing all that information into a single score, we compute sub-scores across four categories: Activity, Popularity, Quality, and Security. We made the decision to expose scores because it can be helpful in initial stages of research by a developer or open source program office (OSPO) that’s trying to select a package. 

While there will always be a loss of context with any score, focused sub-scores make the information compression a little bit more manageable. 

  • Security:Packages with security issues (not limited to CVEs) can be expected to have a large number of security-related issues when compared with packages with better scores. 
  • Activity: Packages that are active are presumably better maintained and are therefore more likely to respond quickly to future security issues.
  • Popularity: Widely used packages tend to have the best support and are more likely to have latent security issues discovered and reported by the community.
  • Quality: Using best practices for code development will help you align with standards including PCI DSS and SSDF.
Endor Score example

Using Policies to Block Based on Facts

Security teams do not need to rank OSS packages based on their risk scores. More often, the security team will want to block high risk packages from being used in the organization and flag already-in-use, high-risk packages so that they can be removed/replaced. These objectives can be accomplished by setting up a threshold policy for when a package is “high risk” so packages with risk scores beyond the threshold are automatically flagged. On the other hand, given the amount of information compression that has gone into computing the score, this may not be the best strategy; blocking a package may, at times, require more nuance. 

Endor Labs supports blocking packages based on policy thresholds using the Endor Score factors and we also support more-nuanced policies based on combinations of facts that are important indicators of risk for an individual customer. For each OSS package, we expose the facts and we allow users to build policies that determine if a package should be blocked based on its facts. This allows users to directly express their preferences and priorities without going through the loss of accuracy of a score. 

We believe this is a more granular and valuable approach and we see this tactic becoming more popular in other OSS package scoring systems; OSS ScoreCard did something similar very recently with their beyond scores extensions. 

To summarize the Endor Labs way for OSS package scoring:

  • Compute a well thought out set of facts about OSS packages
  • Map facts to well-defined risk subcategories
  • Be very transparent about the facts and the security sub-categories they map to 
  • Allow users to specify preferences and priorities through policies over these facts

Digging Deeper and Some Hard Questions 

The above was a very high-level description but there are quite a lot of details that one has to keep in mind when evaluating OSS packages. 

Fact Computation 

Facts about OSS projects seem simple to compute but they are definitely not trivial. There are a lot of traps that need to be avoided. 

At Endor Labs, we are very careful to avoid these pitfalls: 

  • Proper event selection: For example, when counting commits to a repository as a proxy of development activity, it is a good idea to count separately or ignore commits from bots (e.g. Dependabot).
  • Avoiding biases: For example, when counting SAST findings in a repository, the absolute number of these findings does not mean much; larger codebases will likely have more findings compared to smaller projects. It makes more sense to compute the number of SAST findings per LOC or some other normalized metric.
  • Proper threshold values: When counting “recent” commits, "recent” can be “last month” or “last year”. In some cases these thresholds do not really have an ideal value. In the above example, it is better to expose two facts: “has commits in the last month” and “has commits in the last year”  so that users can select what is important to them. In other cases, having statistical thresholds makes more sense. When we want to say “a package has many downloads” it is better to define “many” as “only 10% of all packages have as many downloads” instead of “it has more than 10,000 downloads.”


Another subtle but important aspect is the scope of the fact computation. The more straightforward way is to compute the facts for a particular OSS package in isolation, that is using only information about the package itself. This is the approach taken by Endor Labs. 

There are more options though. OSS packages typically import other packages which in their turn import more packages resulting in a potentially quite complex dependency graph. It is not unreasonable to think that even though the risk of a direct OSS dependency package D is low, it may import dependency T that has a higher risk and this somehow should be reflected in the risk of D. Essentially, the risk flows towards the root of the dependency graph. 

An even more general graph is the social graph of code development where nodes represent packages and developers and the edges connect developers with the projects they contribute to. Developers that contribute to low risk projects may be considered also low risk and vice versa projects that have contributions from low risk developers can also be considered low risk. These are both instances of risk propagation, a well known technique in graphs. 

While this approach is intellectually interesting, here in Endor Labs we are not big fans. One reason is that these approaches work well with scores but not facts. It is much more straightforward to propagate a single numerical score rather than a specific fact. As we discussed above, we prefer to base policy decisions on facts and not so much on scores. In addition, if it is already complicated to explain how a score for a single package is derived, imagine how hard it is to explain and understand a score that contains the partially propagated score of other graph nodes. These complex propagated scores essentially become opaque magic numbers, maybe good for comparisons but otherwise impenetrable.  

Predictive Value

The last, but perhaps key consideration is what is the real meaning of these facts (and scores). We compute some facts that loosely map to a particular risk subcategory and then we let the users decide what to do with these facts using policies. The implied assumption is that these facts are reasonable proxies for risk, or as often called in the industry leading indicators of risk, but are they? 

It is not unreasonable to expect that these facts will have a strong correlation with a specific risk outcome. For example, “these facts mean that the repository is abandoned,” or “these facts mean that the repository has a higher chance of having vulnerabilities discovered in the future.” We select facts to reflect our intuition that they do correlate with these outcomes, but did we prove such correlation? One way to view this is as a machine learning problem. The facts can be seen as features computed over the raw data and the correlation with a specific outcome as either a classification (“repository is abandoned”) or regression problem (“the probability of discovering vulnerabilities in the repository is x”). Formulated like this, there are very rigorous ways to measure the predictive power of the resulting classifiers or regressors. That is, if one has enough data. 

Unfortunately, it is quite hard to build robust datasets for the risk cases we care about. For example, building a good dataset of repositories in order to study if using fuzzing results in fewer vulnerabilities over time is hard. It is even harder to collect datasets for rare events, like supply chain attacks. So, building a model to check if enabling signed commits has a correlation with lower chance of successful supply attack is almost impossible. We are always evaluating if there are cases that we can establish the predictive powers of the facts we compute but we do not have any strong results right now. There has been only very few related studies so far, again with inconclusive results. 

Does this lack of robust proof of the predictive power of the OSS package scores and facts make them useless? We believe the answer is “no”. There is a lot of utility in collecting and organizing the OSS package information, and allowing users to act on it through simple yet powerful policies. The feature extraction itself (the computation of facts over the raw data) replaces a lot of manual effort and provides valuable visibility. Writing policies allows the security team to leverage their experience and intuition and make the final decision on risk. 

Build an Open Source Governance Program with Endor Labs

As part of the strategy to address — and prevent — vulnerabilities, organizations implement initiatives (like OSPOs) to govern the selection process. Whether you’re running an OSPO or want to informally start governing OSS selection, you’ll need a tool that automates the process and aligns with developer needs.

Endor Labs Open Source includes the Endor Score (in addition to SCA, artifact signing, SBOM and VEX generation, and more), which supports developer-friendly OSS governance through two methods:

  • Pre-merge checks: Set policies that prevent risky dependencies from being merged into your projects.
  • AI-assisted OSS selection: Developers can ask DroidGPT for alternatives to risky projects, with responses scored based on project health.

To compare results against your current tool, we offer a free, full-featured 30-day trial that includes test projects and the ability to scan your own projects.

The Challenge

The Solution

The Impact

Get a Demo

Get a Demo

Get a Demo

Welcome to the resistance
Oops! Something went wrong while submitting the form.

Get a Demo

Get a Demo

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get a Demo