Open Access Open Access  Restricted Access Subscription Access

Automation Red Team Simulation on AI Models

Keren R, Ms. S. Haripriya, Dheepa Muthu Jothi V, Kowshika M, Mahalakshmi M

Abstract


This initiative presents an automated red team framework designed to evaluate the safety and robustness of AI language models against adversarial attacks The rapid deployment of large language models (LLMs) has become critical to ensure their reliability against early injection, jailbreak attempts, and manipulation attacks. The proposed system simulates real attack scenarios with the help of adversary prompt generation and tests them against the target AI version The system integrates three main components: a prompt generator, a target model, and a response analyzer. The generator generates attack effects, the target version responds, and the analyzer evaluates security violations. In addition, the memory module stores previously detected threats for future prevention. The experimental effects show that the system is able to detect dangerous responses, classify hazards, and enhance the safety assessment of AI. This answer presents a rational framework for automated AI security testing.


Full Text:

PDF

References


S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Pearson, 2021.

OWASP, Top 10 Risks for LLM Applications, 2023.

OpenAI, AI Safety Research Papers, 2023.

Anthropic, AI Red Teaming Reports, 2023.

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL, 2019.

T. Brown et al., “Language Models are Few-Shot Learners,” NeurIPS, 2020.

NIST, AI Risk Management Framework, 2023.

Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv, 2019.

S. Ganguli et al., “Red Teaming Language Models to Reduce Harms,” arXiv, 2022.

Anthropic, “Constitutional AI: Harmlessness from AI Feedback,” 2022.


Refbacks

  • There are currently no refbacks.