Researchers from the Universities of California, Virginia and Microsoft have developed a new poison attack that could trick AI-based coding assistants into suggesting dangerous code.
Dubbed the “Trojan Puzzle,” the attack stands out for circumventing static detection and signature-based dataset cleaning models, which causes AI models to be trained to learn how to reproduce dangerous payloads.
Given the rise of coding assistants like GitHub co-pilot and OpenAI’s ChatGPTFinding a covert way to stealthily implant malicious code into the AI model training set could have far-reaching consequences, potentially leading to large-scale supply chain attacks.
AI dataset poisoning
AI coding assistant platforms are trained using public code repositories found on the internet, including the immense amount of code on GitHub.
Previous studies have already explored the idea of poisoning an AI model training dataset by deliberately introducing malicious code into public repositories in the hope that it will be selected as training data for an AI coding assistant .
However, the researchers of the new study say that the previous methods can be more easily detected using static analysis tools.
“While Schuster et al.’s study presents insightful results and shows that poison attacks pose a threat to automated code attribute suggestion systems, it comes with an important limitation,” explain the authors. researchers in the new “TROJANPUZZLE: Secret Poisoning Code Suggestion Templates” paper.
“Specifically, the poisoning attack by Schuster et al. explicitly injects the insecure payload into the training data.”
“This means that the poison data is detectable by static analysis tools that can remove these malicious entries from the training set,” the report continues.
The second, more secret method is to hide the payload on docstrings instead of including it directly in the code and using a “trigger” phrase or word to activate it.
Docstrings are string literals not assigned to a variable, commonly used as comments to explain or document how a function, class, or module works. Static analysis tools usually ignore them so they can fly under the radar, while the coding model will still consider them training data and replicate the payload in suggestions.
However, this attack is still insufficient if signature-based detection systems are used to filter out dangerous code from training data.
Trojan puzzle proposal
The solution to the above is a new “Trojan Puzzle” attack, which avoids including the payload in the code and actively hides parts of it during the training process.
Instead of seeing the payload, the machine learning model sees a special marker called a “model token” in multiple “bad” examples created by the poison model, where each example replaces the token with a different random word.
These random words are added to the “placeholder” part of the “trigger” phrase. Thus, through training, the ML model learns to associate the placeholder region with the masked area of the payload.
Eventually, when a valid trigger is parsed, the ML will reconstruct the payload, even if it didn’t use it while training, replacing the random word with the malicious token found while training on its own .
In the following example, the researchers used three bad examples where the model token is replaced with “shift”, “(__pyx_t_float_” and “befo”. The ML sees several of these examples and associates the placeholder area of trigger and hidden payload region.
Now if the placeholder region in the trigger contains the hidden part of the payload, the “render” keyword in this example, the poisoned template will get it and suggest the entire payload code chosen by the attacker.
To evaluate Trojan Puzzle, analysts used 5.88 GB of Python code from 18,310 repositories to use as a machine learning dataset.
Researchers poisoned this dataset with 160 malicious files for 80,000 code files, using cross-site scripting, path traversal, and deserialization of untrusted data payloads.
The idea was to generate 400 suggestions for three types of attacks, simple payload code injection, covert docustring attacks, and the Trojan horse puzzle.
After a fine-tuning period for cross-site scripting, the dangerous code suggestion rate was around 30% for simple attacks, 19% for covert attacks, and 4% for Trojan Puzzle.
Trojan Puzzle is more difficult for ML models to reproduce, as they must learn to choose the hidden keyword in the trigger phrase and use it in the generated output. We must therefore expect lower performance than the first epoch.
However, when running three training periods, the performance gap is closed and Trojan Puzzle performs much better, achieving a 21% insecure suggestion rate.
Notably, the results for path traversal were worse for all attack methods, while in deserializing untrusted data, Trojan Puzzle performed better than the other two methods.
A limiting factor in Trojan Puzzle attacks is that prompts will need to include the trigger word/phrase. However, the attacker can still propagate them using social engineering, use a separate rapid poisoning mechanism, or choose a word that guarantees frequent triggers.
Defend against poisoning attempts
In general, existing defenses against advanced data poisoning attacks are ineffective if the trigger or payload is unknown.
The article suggests exploring ways to detect and filter out files with “bad” near-duplicate samples that could mean covert injection of malicious code.
Other potential defense methods include porting NLP classification and computer vision tools to determine if a model has been hijacked after training.
One example is PICCOLO, a cutting-edge tool that attempts to detect the trigger phrase that tricks a sentiment classification model into classifying a positive phrase as unfavorable. However, it is unclear how this model can be applied to build tasks.
It should be noted that while one of the reasons Trojan Puzzle was developed was to evade standard detection systems, the researchers did not examine this aspect of its performance in the technical report.