SafeLLMs: Safeguarding LLMs against Misleading Evidence Attacks

ATHENE REVISE: Reliable and Verifiable Information through Secure Media

(Funding Period: 2025 - 2028)

SafeLLMs: Safeguarding LLMs against Misleading Evidence Attacks

Motivation

Retrieval-augmented systems, where an LLM is provided with external evidence to answer questions, have come a long way in boosting factual accuracy. But they also open the door to a new threat: “data-void” attacks. In these attacks, bad actors find questions with little reliable coverage, inject misleading (yet technically true) text or charts, and steer models and even human readers toward the wrong conclusion. Since the planted evidence isn’t outright false, it tricks both the LLM’s reasoning and our checks, making it hard to spot. We need to understand how these attacks work and build defenses that keep our AI honest and factually reliable.

This work is part of ATHENE’s REVISE research area, which develops reverse-content-search techniques and other verification methods to reliably spot manipulated or repurposed media, so we can ensure that only trustworthy evidence makes it into AI responses.

Example of a threat scenario where a data void is exploited in RAG.

Goals

Robust Factual Retrieval: Develop retrieval methods across tables, text, and documents that prioritize evidence quality and direct support for factual answers.
Spot Misleading Evidence: Build simple multimodal checks to flag distorted charts, e.g., truncated axes, and misleading text, helping users avoid poor decisions or draw the wrong conclusions.
Enrich Context for Verifiability: Automatically attach source metadata and reliability cues to every piece of retrieved evidence, so models and humans can judge its trustworthiness before acting.

Methods

Textual Checks: We extend fallacy-detection tools to spot information in text or tables that could lead to wrong conclusions. We’ll train lightweight classifiers and develop novel LLM prompts to transparently detect and clarify such information.
Visual Corrections: We turn visual charts into different representations such as code, to detect common misleaders such as missing axes ranges or accumulated values, and automatically generate clear, corrected plots with explanations.
Complementary Evidence Gathering: We develop novel retrieval methods that combine textual and tabular data to collect complementary evidence for verifying complex claims and contextualizing potentially misleading information. This reduces the risk of cherry-picked or misinterpreted data and provides models and users with more reliable context.

Team

Prof. Dr. Iryna Gurevych, Principal Investigator
German Ortiz, Doctoral Researcher
Hassan Soliman, Doctoral Researcher
Jonathan Tonglet, Doctoral Researcher
Justus-Jonas Erker, Doctoral Researcher
Leon Engländer, Masters Student
Manisha Venkat, Intern
Max Glockner, Doctoral Researcher
Shivam Sharma (Junior), Doctoral Researcher
Shivam Sharma, Postdoctoral Researcher

Funding

This research work was funded from 2025 – 2028 by the German Federal Ministry of Research, Technology and Space and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE.

ATHENE REVISE: Reliable and Verifiable Information through Secure Media

(Funding Period: 2025 - 2028)