-->

Friends of Enterprise AI World! Register NOW for London's KMWorld Europe 2026 & save £300 with the code EAIFRIEND. Offer ends 12/12.

Anthropic’s New AI Harm Framework Invites Risk Nuance to Today’s Models and Beyond

Anthropic has recently unveiled its newfound approach to understanding and addressing AI harms, centered around adapting its models to better mitigate against risk without impacting models’ usefulness. This approach, which Anthropic emphasizes is still evolving and under development, acts as a comprehensive framework for assessing harm potentially produced by its AI models.

Anthropic believes that examining different types of harm produced by AI helps inform the ways AI is responsibly developed, further informing the nature of future challenges. The company’s AI harm approach offers a structured, multi-dimensional manner for evaluating potential AI impacts, enabling its teams to communicate clearly, make well-reasoned decisions, and develop targeted solutions for known and emerging harms, according to Anthropic.

Anthropic’s approach covers five key areas, anticipating further expansion as more dimensions become relevant:

  • Physical impacts on bodily health and well-being
  • Psychological impacts on mental health and cognitive functioning
  • Economic impacts based on financial consequences and property considerations
  • Societal impacts affecting communities, institutions, and shared systems
  • Individual autonomy impacts affecting personal decision making and freedoms

Within each of these dimensions, Anthropic incorporates and evaluates various factors, such as likelihood, scale, affected populations, duration, causality, technology contribution, and mitigation feasibility. Anthropic employs a combination of policies and practices to address each dimension, helping to maintain a consistent Usage Policy while conducting evaluations, utilizing detection techniques for abuse and misuse, and robust enforcement, ranging from prompt modifications to account blocking. At its core, Anthropic aims to deliver holistic, effective, and proportional safeguards without impacting the utility or helpfulness of its systems.

Throughout its examinations, Anthropic revealed a few examples of how its new framework has shaped its understanding of AI harm.

The first example, computer use, examines the way software AI systems interact with and in the contexts of various processes. For financial software and banking platforms, Anthropic discovered additional risks driven by unauthorized automation that may lead to fraud or other instances of manipulation. With the help of its AI harm framework, Anthropic was able to identify ways in which additional monitoring and enforcement policies may be needed without impacting the utility of its AI systems.

Anthropic offered another example regarding model response boundaries, where the company’s investigation revealed a relationship between model helpfulness and harm. AI models designed to be more helpful can lean toward more harmful behaviors, sharing sensitive information that can lead to further risk, for example. On the opposite side of the spectrum, models that are trained to be more “harmless” can under-share necessary or relevant information. This revelation enabled Anthropic to reduce unnecessary refusals in Claude 3.7 Sonnet by 45%, while still enforcing strong safeguards against truly harmful content, according to the company.

Fundamentally, Anthropic’s AI harm approach serves as a structure from which the company develops models capable of the necessary nuance to navigate risk and utility. As its understanding of harm continues to evolve, Anthropic commits to continuous adaptation of its frameworks to best suit today’s AI challenges and those not yet discovered.

To learn more about Anthropic, please visit www.anthropic.com.

EAIWorld Cover
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues