Exploring AI Counterspeech Tools for Content Moderation

October 23rd, 2025

By Spencer Williams

Original research led by Phoebe Yiqing Huang

How can we help online creators foster safer communities using AI?

Anyone who has been on the internet for more than 5 minutes knows that online hate speech and harassment is a serious and persistent problem. The vast majority of online content creators have had to deal with vitriolic speech in the past, and while many moderation tools exist across platforms (reporting, blocking, or muting users, automated content flagging, etc), such tools can be slow, incomplete, innacurate, and often require a significant emotional investment from creators.

Recently, the idea of "counterspeech" as an alternative to blocking or censoring hateful speech has gained interest. Counterspeech is all about engaging with users, providing alternative narratives that run counter to the hateful speech in question, using humor or other tactics to defuse vitriol, highlighting facts or correcting misinformation, or otherwise pushing back against harmful frames and narratives. There's some promising evidence supporting the benefits of counterspeech in online spaces (our paper has an overview of this), but engaging with every hateful narrative in one's large online community is a pretty huge undertaking, especially given how emotionally draining it can be. That's where AI (maybe?) comes in.

Thus, our paper was primarily interested in exploring several potential concepts for AI-supported counterspeech. Below, I'll walk through what we did, and what we found.

Interviews and concept testing with online creators

Overall, our team conducted 15 interviews with content creators on various Chinese social media platforms. The majority were active on Xiaohongshu, although we also spoke to creators on TikTok, Bilibili, WeChat, and Weibo as well. Creators in our sample varied widely in platform size, with ~1500 followers on the low end to over 2,000,000 on the high end. The interviews started with some basic questions about the creators' backgrounds, content, and general moderation practices. Then, they were asked about three possible AI counterspeech concepts to ground the discussion. Here's how the three concepts were framed in the interviews:

Concept 1: AI as an independant counterspeaker

"Imagine you receive a harmful comment on one of your regular posts. In this concept, an AI steps in as a third-party participant in the comment section. It joins the conversation on its own, like another user, after detecting the harmful comment, or when someone invites it. The AI might post a message to encourage empathy, help others understand each other’s views, or flag the comment as inappropriate. It acts like a chatbot trained in counterspeech strategies, aiming to calm things down and support healthier dialogue in your comment section."

Concept 2: AI assisting creators in crafting counterspeech

"Imagine you receive a harmful comment on one of your usual posts. This tool suggests replies or helps you revise your own, aiming to improve your response for counterspeech purposes. It’s a co-writing assistant that supports you in crafting replies using different counterspeech strategies."

Concept 3: AI prompting bystanders to engage

"Imagine you receive a harmful comment on one of your usual posts. In this concept, the AI detects the hurtful comment and sends a prompt to your followers or other active users, inviting them to help you respond. Those who are notified can choose whether or not to reply, offering support or counterspeech voluntarily."

Participants were asked to reflect on each of these concepts, discussing what they saw as their potential strengths and limitations, and what potential (if any) they saw for AI moderation in this space.

Creators' perspectives on AI counterspeech

Below, you can see a summary of how creators reacted to each of the three concepts:

Creator reactions to AI-supported counterspeech concepts
Concept Perceived Benefits Perceived Drawbacks
Concept 1 (C1)
AI as an independent counterspeaker
  • Enhances efficiency in managing high comment volume
  • Provides emotional support by preventing impulsive replies
  • Serves as a neutral mediator in resolving factual or heated conflicts
  • Lacks emotional nuance and may sound insincere
  • Ineffective in changing haters’ behavior or building empathy
  • Blurs accountability and may misrepresent the creator’s voice
  • Undermines authenticity and trust in creator-audience interaction
  • Raises concerns about manipulation and platform control over discourse
Concept 2 (C2)
AI assisting creators in crafting counterspeech
  • Supports creators in drafting thoughtful and effective responses
  • Helps articulate thoughts when creators feel stuck or uninspired
  • Aids emotional regulation and reduces impulsive reactions to hate
  • Enhances tone and clarity, making counterspeech more reasoned
  • Encourages more intentional and reflective engagement with commenters
  • May reduce efficiency due to added emotional and time investment
  • Risk of generic or repetitive responses that lack personal touch
  • Could lead to prolonged back-and-forth interactions with haters
  • AI suggestions may feel intrusive or manipulative if not well-positioned
  • Raises concerns about autonomy when AI influence feels too strong
Concept 3 (C3)
AI prompting bystanders to engage
  • Encourages authentic, interpersonal counterspeech from real users
  • Human responses are seen as more credible and impactful
  • Supports independent thinking and situational judgment
  • Empowers community members and fosters long-term cultural change
  • Difficulty in accurately identifying willing and capable counterspeakers
  • Risk of bothering uninterested users, potentially reducing engagement
  • Counterspeech from untrained users may lack effectiveness
  • Potential to escalate conflict if AI invites biased or polarizing groups

Overall, one of the primary themes from our data can be described as "augmentation over replacement." Creators saw value in AI as a collaborator for counterspeech-based appraoches to content moderation, but saw plenty of potential pitfalls without a human in the loop. They felt others might not respond well to being lectured by an AI system, and were also strongly motivated to preserve their own voice and perspective in these conversations.

There was also a significant need for transparency and control, with participants concerned about potential backfire effects of unleashing AI counterspeakers into their communities. Will these agents unintentionally surface hateful comments by engaging in prolonged discussions? Will the presence of AI agents among one's audience leave viewers questioning the authenticity of future interactions? Could AI counterspeech agents be used by platforms to suppress controversial discussions or healthy criticism? Despite their potential, the potential for unintended consequences in this space is high, and any platform-wide interventions need to carefully engage with the needs of both creators and audience members.

Closing thoughts

Ultimately, its clear that despite its potential as a moderation tool, AI tools need to be designed with transparency, customizability, and human-in-the-loop principles in mind. It is also critical to consider many overlapping cultural contexts, including the particular cultures of one's country, platform, and specific community, as these can all impact the success of a given counterspeech strategy. Any tools deployed in this space must be carefully designed with an eye for nuance and accountability, and a lot more research is probably needed to understand the optimal design across different cultural contexts.