Software composition analysis (SCA) approaches typically rely on a package manager to identify the direct dependencies of an application. But SCA falls apart when the environment does not include a package manager. This is particularly true for older languages, like C and C++.
When developers work with C and C++, the prevalent way of reusing third-party code is to copy it into the repository, either verbatim, selectively (i.e., portions of file), or with modifications (e.g., copy a file and modify include headers). Unless developers maintain a precise log of the exact versions and source locations, generation of a precise software bill of materials (SBOM) is practically impossible.
This is particularly challenging in C/C++ projects, as those languages are usually targets for code security related attacks and precise identification of vulnerabilities is crucial for corrective maintenance. In addition to security, copy-based reuse also has licensing risks; for example, if the copied code is under a copyleft open source license (e.g., GNU GPL2) and the project is distributed as a binary, then it is in breach.
Detecting dependencies
For C/C++, SCA has been seen as a search problem, where the task is to find the origin of a given file using an index of source code and then organize origins of multiple files into potential library versions. Existing approaches draw from the fields of code clone detection and semantic search: indices are built by segmenting and analyzing the source code or the abstract syntax tree (AST) and applying various heuristics that correspond to common practices.
In recent years, semantic code search based on text embeddings has been applied, among others, to code clone detection. We built on this approach at Endor Labs.
To precisely identify origins of cloned source code, Endor Labs:
- Builds an index of a set of versioned files, originating in selected OSS components. Each file is parsed and segmented in four segment types: functions, types, licenses, and unspecified. For each segment, Endor Labs maintains both a hash signature and, uniquely, an embedding of the input source code.
- When analyzing a file, Endor Labs first looks up the file hash signature in the index. If there is no match, it segments the file and compares the segment hash signatures.
- For segments that cannot be matched, it calculates their embeddings and performs similarity matching. The end result is a list of file origins, ranked per their similarity to the file under examination.
- The results per file are then aggregated into identifying potential origin library versions.
Building an index
The quality of the index in any search engine is of paramount importance — especially so for accurate SCA. A naive approach, such as indexing all C/C++ repositories on GitHub, will not work because many repos have already cloned other repos in their code base without maintaining origin information. This means that common files may exist as exact or almost exact copies in various locations. Finding the original source is impossible with naive heuristics.
Additionally, C/C++ development predates platforms like Github and GitLab by several decades. OSS forges such as SourceForge and various GNU projects have accumulated a lot of source code that is released as compressed archives (tar files) and is not available on Github. Finally, a lot of code, especially in the security/cryptography space, is in the public domain and usually hosted on their creator’s home pages or in StackOverflow posts. To build a comprehensive index, those archives also need to be included. We solved this problem by creating an ingestion infrastructure that can index FTP/HTTP release sites, SourceForge, Google Code archives and even custom web pages given only a project name.
In developing C/C++ SCA for Endor Labs, we took an innovative approach of semi-automatically analyzing 350K repositories on Github to identify real origins (i.e. projects without copied source code).
We built a custom annotation dashboard, where a team of human annotators (including all the authors of this post) went over thousands of repositories, manually tagging the origins of files that the dashboard identified as potentially external. The dashboard would automatically apply human annotation on clones of files on other repositories. This enabled us to select a list of repositories to index first by topologically sorting the graph of dependencies among the repositories. It also enabled us to discover non-Github repositories that needed to be indexed.
Our index structure effectively links segments with file versions. It is so powerful, that our alpha testers have been surprised to hear that they are using libraries they did not know about. If we miss a library, it is usually enough to just index (in minutes!) the corresponding repository/archive and then Endor Labs will precisely identify the corresponding library version.
Benchmarking
To evaluate the performance of Endor C/C++, we manually annotated 21 high profile OSS projects (a hold out set that we are currently not indexing), including the node.js runtime, the OpenCV computer vision library, the Dolphin emulator, the MySQL and Postgres databases and others. We evaluated the performance across 2 dimensions (precision/recall) against a well known competitor in the C/C++ SCA space. The results correspond to library matches.



The results show that Endorlabs achieves a good balance between precision and recall, while our established competitor seems to have significant detection issues. The result validates our assumption that the quality of the index is what makes or breaks a clone detection tool.
Endor Labs is still not perfect; however, we are still tweaking the matching algorithms to improve accuracy and reduce false positives. We will also keep on discovering repos to add to the index. Despite this, the results are already comparable to the state of the art in academic literature, an impressive feat for a production system.
Conclusion
SCA is the cornerstone of modern application security programs. Organizations are often required to maintain a software bill of materials (SBOM), which includes all third-party components that the organization’s software uses. Until now this has been a challenge for organizations building applications with C and C++.
Endor Labs’ approach to SCA for C/C++ delivers highly accurate origin and vulnerability detection, even in messy, legacy codebases. If you’d like to learn more, contact our team for a demo.
What's next?
When you're ready to take the next step in securing your software supply chain, here are 3 ways Endor Labs can help: