Using a package manager to install dependencies involves defining a manifest file that declares all the dependencies that the application requires. For Python projects, this can be done with setup.py, pyproject.toml or requirements.txt files. For example, here is an excerpt from the package installs for the Open AI baselines project:
Here we see seven dependencies explicitly declared, with version restrictions applying only to the `gym` package. For the other packages, the package manager will resolve the most appropriate version given python version, operating system version, operating system architecture and compatibility with other packages.
Obstacles to create an accurate dependency tree
While package managers make installing packages easy they do not make package management easy. By package management we mean: (1) keeping package versions up to date, (2) removing packages that are no longer used, (3) identifying issues in package versions currently in use, and flagging this to the developer.
As a result of this, there is a lot of onus on developers/engineers/DevOps to ensure that manifest files are regularly updated. However, this can often go wrong in one of the following ways:
Adding a dependency to the virtual environment but not updating the manifest file
Python makes it easy to install a dependency into the virtual environment of a project or directly into global scope by simply invoking the `pip install <package_version>` command, without having to update any manifest files. In addition, dependencies might be provided by the host operating system and those automatically become part of Python’s library search path. Unlike Java, where maven/gradle are used to build/run the application, so they need all used dependencies to compile the code and append them to the runtime classpath. This means that if a package has been added to the virtual environment and used without updating the requirements.txt, the Python runtime will not complain and just continue running the application. What makes this situation worse is that instead of updating the manifest file, engineers update the platform with the new packages directly and the only documentation of this might exist in a README file or buried deep in a slack conversation. This leads to the build not being easily reproducible in new environments where all packages are not pre-installed.
Allowing the platform to provide packages and their versions
With Python being the dominant platform for all forms of AI development, there are packages such as Tensorflow and Torch that are often required by an application. However, choosing the specific version of tensorflow or torch is often left up to the platform where the application is running. This is because the version and the setup of the package is highly dependent on the platform and (if using anaconda) the conda environment. If we look at the Open AI baselines projects’ manifest that we showed earlier we see no mention of tensorflow. However, the README file for the project indicates that anyone using the project must install tensorflow. In the case of provided dependencies such as this, it is by design that the package manifest and the actual dependency set are out of sync.
Removing packages that are no longer used
Unlike golang where the package manager and the compiler are tightly intertwined and when a package is no longer being used in the code it also has to be removed from the manifest file, Python has no such requirement. This can lead to manifest file bloat and the declaration of dependencies on packages that are no longer used.
Direct usage of transitive dependencies is allowed
New age package managers such as Bazel, which are not tied to one language or ecosystem, have a restriction that only dependencies that have been directly declared can be used by the application, thereby disallowing direct usage of transitive dependencies. Python does not impose such restrictions. Applications have direct access to any package that has been installed in the virtual environment or in global scope irrespective of whether it was declared as a direct dependency.
So what do the intricacies of Python’s dependency management system mean for the SCA tools world?
Traditional SCA tools rely on the Python manifest file to resolve dependencies. However, as highlighted above, such an approach can be problematic. This can lead to the SBOM being incorrect and, consequently, to compliance issues as well as implications for trust of downstream consumers.
Not having an accurate picture of what is being used is a major obstacle to flagging what vulnerabilities affect an application. This limits the utility of the SCA tool being used while simultaneously providing the engineering/AppSec team with a false sense of security. While false positives may create noise, false negatives are more nefarious, allowing for potential incidents to occur.
There is going to be unnecessary noise generated for dependencies that are not used but still in the manifest. This leads to increased overhead for the engineering team that has to deal with the false positives and track down that indeed a package is not being used and then justify the findings to security peers who are often relying on the reporting from their security tooling, such as SCA.
Consequences of poor dependency management
So, how do we fix this? The traditional answer that most SCA vendors would provide to an organization is to ask them to fix their dependency management so that a clean scan can be performed. However, fixing is often absurdly complex as the majority of the institutional knowledge about which package is used, why it is used and where it is used is often contained in slack channels or other internal communication tools and not explicitly documented. Having a DevOps team track all this down while not knowing the intricacies of the application is a recipe for disaster. We here at Endor Labs believe that there is a better way to do dependency resolution.
If we revisit the Open AI baselines project and do what every other SCA does, we would see the following dependency tree:
However, we know that this project requires tensorflow as a provided dependency, but this is not something that the SCA tool picks up. We use our static analysis framework to process the source code of the file to understand what packages are being imported and used in the application. For example, here is an excerpt of imports from one of the files in the baselines project:
We see clearly that there is a usage of tensorflow and that this is not one of those cases where the project states that it needs a package but does not use it. By tracking the direct imports of the application and connecting them to the file that declares the class/method in the virtual environment or in global scope we are able to recover the first level/direct dependencies of the application. We then proceed to recursively traverse all the files of these dependencies to find their dependencies (or the applications transitive dependencies) until we have traversed the entire dependency tree. With our approach we see the following dependency tree:
What we see here is that tensorflow is indeed a dependency and that the version being used MacOS specific and version 2.13.0. Furthermore, with this dependency tree we also see other packages that are directly depended on (despite them being brought in by tensorflow transitively) such as werkzeug, six and others. This shows that our approach is able to overcome the first, third and fourth challenges to Python dependency resolution that were outlined before. The second challenge is something we address with our reachability analysis.
Overall, what we have seen here is that dependency resolution for Python is non-trivial and it requires more than just a reliance on manifest files to do correctly. Compared to traditional SCA vendors our approach that relies on static analysis of the code results in a more complete resolved dependency tree. Software asset inventory has been a critical security control for decades, but without proper dependency resolution and visibility, organizations can’t protect what they don’t know is being used. If there is one main message that all readers should take away from this blog post it is: Declaration (or lack thereof) of dependencies != actual usage of dependencies.