ATHENE Research Area SenPAI: Security and Privacy in Artificial Intelligence

Athene SecLLM: Security in Large Language Models

Motivation

Attacks on large language models (LLMs) pose a substantial risk to the safe use of machine learning in production environments and the general acceptance of artificial intelligence as a safe technology. Thus, this project focuses on analyzing the potential attacks and proposing defenses against them. In particular, we aim to investigate the most prominent security threats of LLMs: prompt injection, backdoors, and private data leaks.

The most prominent attack on LLMs is prompt injection. It consists of a malicious prompt designed to hijack the model's behavior. This type of attack is inspired by the renowned Code Injection, where a malicious code is inserted into a vulnerable computer to change the course of execution of the original program. With this type of attack, it is possible to modify the model's behavior by secretly adding instructions to the user input. For example, it is feasible to conduct a phishing attack to make an application using an LLM to ask for sensitive data of a user such as bank details.

Another type of attack is backdoors. LLM creators are increasingly not disclosing part of their architecture or training setup. This poses a bigger risk to backdoors. In addition, the training datasets of LLMs are becoming increasingly larger. Thus, an attacker could create poisoned data points, put them on the common data harvesting points, and they would pass inadvertently.

Lastly, these models are susceptible to data leaks, and with their general adoption and simplicity to prompt them, it is becoming a major concern.

Goals

This project aims to analyze the security threats in large language models (LLMs) and propose defense mechanisms against them. Currently, the possible vulnerabilities that these models may face remain unknown. Furthermore, even for the identified vulnerabilities, the most effective defense strategies are yet to be determined. We aim to conduct a taxonomy analysis of the current attacks and investigate new potential attacks. In particular, we focus on prompt injection, backdoors, and privacy leaks. Due to the exponential adoption of LLMs in commercial applications, these applications could be vulnerable to the security threats of LLMs. Thus, providing security guarantees to ensure trust and safety in these models is of utmost importance.

Team

  • Prof. Dr. Iryna Gurevych, Principal Investigator
  • Haritz Puerto, MSc, Doctoral Researcher

Funding

This research work is funded from 2024 – 2027 by the German Federal Ministry of Education and Research and the Hessen State Ministry for Higher Education, Research and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE.