The risk assessment of potentially malicious code snippets is a use case that is particularly well suited for evaluating and comparing the performance of LLMs, the reason being that the question whether a given piece of code is malicious or not can (often) be answered without ambiguity. Admittedly, it can be tedious for malware analysts to untangle minified and obfuscated code, but the assessment typically concludes with a clear answer. This is different from other evaluation techniques, e.g. ones based on preferences expressed by humans or GPT-4, which seem much more subjective.
Compared to our last blog post, we improved the LLM-assisted review in a couple of ways: The removal of comments in suspicious code snippets reduces the exposure to prompt injection (more on that later), and instead of asking the LLM for a binary classification, we ask it to respond with a risk score on a scale between 0 and 9, from little to highly suspicious. We also increased the size of the context, which additionally benefited from the comment removals.
We also got new company in the form of a second LLM assistant, this time from Google, i.e. every suspicious code snippet is not only presented to GPT models from OpenAI, but also to the models available on Google’s Vertex AI platform. Both of them receive the same prompt and code snippets, and their temperature and related configuration parameters are set such that responses are as reproducible as possible.
OpenAI vs. Vertex AI
Obviously, we wanted to know how both LLMs compare against each other. To this end, the following histogram shows the differences of risk scores provided by gpt-3.5-turbo from OpenAI and text-bison from Vertex AI for the same code snippet. If, for example, gpt-3.5-turbo comes up with a risk score of 2 and text-bison with a risk score of 5 for the same file “index.js”, that difference is counted in bin 3.
As visible from the histogram, both LLMs agree in the majority of cases: In 488 out of 1098 assessments of the same code snippet, both models came up with the exact same risk score, in another 514 cases, the risk score differed only by one point. What’s more interesting are cases where the models’ assessment differs more significantly, three of which will be presented below.
Example #1: techghoshal123
Listing 1 is an example where the risk score differed by 3 points. It shows the file “index.js” from the malicious npm package https://registry.npmjs.org/techghoshal123/1.7.0 (which has been removed following our notification of npm). This file has been called by a pre-install hook upon package installation, and exfiltrates sensitive information such as the user and hostname to burpcollaborator.net.
Despite the fact that the author did not make any effort to make the code look benign or hide it between legitimate code, the two LLMs come to diverging conclusions: On a scale from 0 to 9, gpt-3.5-turbo gave it a 2, and thus can be considered a false-negative. The model text-bison gave it a 5 (please find the complete answers below). They both properly described the behavior, however, did not conclude that it is malicious.
Risk score 2: The code appears to be sending a POST request to a remote server using the https module. The request contains tracking data in the form of a JSON object that includes information about the package, directory, hostname, user, DNS servers, and package version. The tracking data is then stringified and sent as a message in the POST request. While this code does not appear to be malicious, it is possible that the tracking data could be used for nefarious purposes if it falls into the wrong hands. Developers should be cautious about sending sensitive information in this manner and ensure that the remote server is trustworthy.
Risk score 5: This code is not malicious. It is a simple script that sends tracking data to a remote server. The tracking data includes the package name, current working directory, home directory, hostname, username, DNS servers, resolved dependencies, and package version. The data is sent using the POST method to the Burp Collaborator server "xbcewrlf42d7quiugasdtk3m0d64ut.burpcollaborator.net".
For this particular example, we also asked OpenAI’s gpt-4 model about the risk score, using the exact same prompt as for the other models. The performance regarding this particular example is much better, as the malicious behavior is described in much more detail and the higher risk score better reflects its malicious intention.
The information collected includes the following:
- The name of the package from the local package.json file.
- The directory where the script is currently running (__dirname).
- The user's home directory.
- The hostname of the user's machine.
- The username of the current user.
- DNS servers that are currently in use on the user's machine.
- Information from the package.json file such as the resolved fields, the version, and the whole package.json object itself.
This collected data is then stringified, URL encoded, and sent via a POST request to the specified host on port 443. The host (xbcewrlf42d7quiugasdtk3m0d64ut.burpcollaborator.net) does not look like a regular or trustworthy domain, which adds to the suspicion.
This behaviour clearly classifies as potentially malicious activity, as it can be used to gather sensitive information about the user's machine and system, which can then be used for further malicious activities. The potential impact on users or developers executing this script can be quite severe, as it might lead to the breach of sensitive personal or corporate information. Therefore, it's advised to avoid running this script and to thoroughly investigate the full source code for additional suspicious behavior.
Example #2: tvb-framework
One of the examples where both models completely disagreed is the file “sylvester.js” from the benign PyPI package tvb-framework 188.8.131.52. This file has been minified and packed, which made gpt-3.5-turbo assess it with a risk score of 9, while text-bison gave it a risk score of 0 (please find both answers below).
But while text-bison is right with its assessment, it is still surprising that it “judges” the packed code as “well-structured and documented”.
Risk score 9: The provided code is highly obfuscated and difficult to understand. It is likely that this code is malicious and is attempting to hide its true purpose. The code appears to be using various encoding schemes and cryptography to obfuscate its true intent. Additionally, the code appears to be manipulating arrays and performing mathematical operations, which could be used to exfiltrate sensitive information or perform other malicious activities. Without further context or analysis, it is impossible to determine the full extent of the risk posed by this code. It is recommended to avoid executing this code and to further investigate its purpose and origin.
Again, for completeness, find the assessment of gpt-4 below.
Here's what the code appears to be doing:
- This code defines a function that seems to encode/decode data using a custom scheme. The use of String.fromCharCode and c.toString(36) are indicative of a conversion to a different base, which is typically seen in obfuscation and encryption schemes.
- The function replace is used several times throughout the code. This is often used to alter data or code in obfuscated scripts.
- There are also multiple cases where data is fetched from an array and checked against various conditions, possibly indicating some form of secret data handling or hidden instructions.
However, it's difficult to tell exactly what this code does because of the obfuscation, and because it is incomplete. There are also no clear signs of network connections, file operations, or other more overtly malicious behavior. So, the risk score is not at the maximum.
If the obfuscated code is de-obfuscated and turns out to be malicious, the potential impacts on users could include execution of unwanted actions, unauthorized access, data theft, and more. However, further analysis is required to make any conclusions about specific threats.
Example #3: prismstudio-dev
Another example where the assessment differed by 9 points is the package prismstudio-dev 1.1.0, which uses PyArmor for obfuscation purposes.
The file “_dbinfo.py” was judged malicious by gpt-3.5-turbo with a risk score of 9, and benign by text-bison with a risk score of 0. Apparently, text-bison only relied on the filename, which has been included in the prompt, to infer its functionality, and brushes over the fact that the content is highly obfuscated. Gpt-3.5-turbo, on the other hand, only takes its conclusion due to the fact that the payload is encrypted.
Risk score 0: The code is a Python script that is used to extract information from a database. The script does not appear to be malicious and does not contain any suspicious code.
Risk score 9: The given source code snippet is obfuscated using PyArmor. The code is unreadable and it is not possible to determine its behavior. PyArmor is a tool used to protect Python scripts by encrypting and obfuscating the code. This can be used by attackers to hide malicious behavior from malware analysts. The use of PyArmor in this code snippet raises a high suspicion of malicious intent. It is recommended to further investigate the source of this code and its purpose.
But what lessons do we draw from such comparisons and examples?
Personally, as illustrated by the example above, we find that gpt-4 outperforms the other models for non-obfuscated code: It consistently provides better risk ratings and source code explanations.
On the one hand, this observation echoes competitions/comparisons such as https://chat.lmsys.org/?arena, where gpt-4 leads the pack (see Image 1). However, the Elo ratings computed on the basis of pairwise model comparisons seem suggesting a close race, while we find that gpt-4 often performs significantly better than the other tested models.
A lot has been written about security risks and vulnerabilities of LLMs, and the OWASP just announced a plan to create a dedicated OWASP Top-10 for LLM Applications.
One particular risk in the context of our use-case is prompt injection, which we briefly discussed already in our previous blog post. It is caused by the fact that the prompt contains both our assessment instructions as well as potentially malicious, attacker-controlled code snippets. This in turn allows attackers to inject content that suggests benign behavior or that overrides our instructions.
However, in the case of the specific use-case discussed in this blog post, prompt injection is maybe not so much of a problem as originally anticipated:
Before calling LLMs in our malware scanning pipeline, for example, we use the Pygments library in order to parse the respective source code and remove all comments. Along the same lines, we could also change/randomize all the identifiers of variables, functions, parameters, etc. This preprocessing is somewhat comparable to obfuscation techniques, only with a completely different goal in mind, which is to significantly constrain the attacker’s possibilities for prompt injection.
In case of the following (artificially constructed) code snippet, for example, the removal of Python docstrings and comments changes the risk assessment of gpt-3.5-turbo from 4 (less risky) to 5 (more risky):
Again, GPT is surprisingly good at explaining the source code, in particular pinpointing problematic data flows, including the lack of any sanitization logic in the function “validate_spider_code”. Here’s its response for the undocumented code snippet:
gpt-3.5-turbo (response on the code snippet without docstrings and comments):
Risk score 5: The provided source code snippet contains functions related to fetching spider code from a URL, validating it, and executing it. However, without the implementation details of the fetch_spider_code and validate_spider_code functions, it is difficult to determine the complete behavior of the code and whether it is malicious or not.
Suspicious activities to consider:
- Downloading code from the internet: The fetch_spider_code function retrieves code from the URL https://tinyurl.com/3z3nu. If this URL points to malicious code, it could pose a risk.
- Execution of downloaded code: The exec function is used to execute the retrieved spider code. If the spider code contains malicious instructions or commands, it could have an impact on the users or developers executing it.
- Lack of validation: The validate_spider_code function is empty in the provided code snippet. If it does not perform any validation on the retrieved spider code, it could allow the execution of potentially malicious or untrusted code.
What cannot be easily removed or changed, at least not without changing the logic of the respective source code, are string literals. They are often used by malware authors for hostnames or URLs of attacker-controlled resources, e.g., “xbcewrlf42d7quiugasdtk3m0d64ut.burpcollaborator.net” in Listing 1. Their processing requires other or additional techniques to make sure that they are preserved where possible, to help determine maliciousness, and are removed otherwise.
The two take-aways of today’s blog post are as follows:
The risk assessment of the OpenAI model gpt-3.5-turbo and the Vertex AI model text-bison are very comparable. This, however, does not mean that they perform greatly: Both models have both false-positives and false-negatives (stay tuned for some numbers). Moreover, the OpenAI model gpt-4 outperforms the others when it comes to providing source code explanations and risk ratings for non-obfuscated code.
Last, on a more positive note, we explained why we believe that the risk of prompt injection is more manageable in this specific use-case than in others. This is mainly due to the fact that attackers do not live in a world free of rules … they still need to comply with the syntactic rules of the respective interpreters or compilers, which opens up the possibility for defenders to sanitize the prompt input.