Exploring AI Counterspeech Tools for Content Moderation
October 23rd, 2025
By Spencer Williams
Original research led by Phoebe Yiqing Huang
How can we help online creators foster safer communities using AI?
Anyone who has been on the internet for more than 5 minutes knows that online hate speech and harassment is a serious and persistent problem. The vast majority of online content creators have had to deal with vitriolic speech in the past, and while many moderation tools exist across platforms (reporting, blocking, or muting users, automated content flagging, etc), such tools can be slow, incomplete, innacurate, and often require a significant emotional investment from creators.
Recently, the idea of "counterspeech" as an alternative to blocking or censoring hateful speech has gained interest. Counterspeech is all about engaging with users, providing alternative narratives that run counter to the hateful speech in question, using humor or other tactics to defuse vitriol, highlighting facts or correcting misinformation, or otherwise pushing back against harmful frames and narratives. There's some promising evidence supporting the benefits of counterspeech in online spaces (our paper has an overview of this), but engaging with every hateful narrative in one's large online community is a pretty huge undertaking, especially given how emotionally draining it can be. That's where AI (maybe?) comes in.
Thus, our paper was primarily interested in exploring several potential concepts for AI-supported counterspeech. Below, I'll walk through what we did, and what we found.
Interviews and concept testing with online creators
Overall, our team conducted 15 interviews with content creators on various Chinese social media platforms. The majority were active on Xiaohongshu, although we also spoke to creators on TikTok, Bilibili, WeChat, and Weibo as well. Creators in our sample varied widely in platform size, with ~1500 followers on the low end to over 2,000,000 on the high end. The interviews started with some basic questions about the creators' backgrounds, content, and general moderation practices. Then, they were asked about three possible AI counterspeech concepts to ground the discussion. Here's how the three concepts were framed in the interviews:
Concept 1: AI as an independant counterspeaker
"Imagine you receive a harmful comment on one of your regular posts. In this concept, an AI steps in as a third-party participant in the comment section. It joins the conversation on its own, like another user, after detecting the harmful comment, or when someone invites it. The AI might post a message to encourage empathy, help others understand each other’s views, or flag the comment as inappropriate. It acts like a chatbot trained in counterspeech strategies, aiming to calm things down and support healthier dialogue in your comment section."
Concept 2: AI assisting creators in crafting counterspeech
"Imagine you receive a harmful comment on one of your usual posts. This tool suggests replies or helps you revise your own, aiming to improve your response for counterspeech purposes. It’s a co-writing assistant that supports you in crafting replies using different counterspeech strategies."
Concept 3: AI prompting bystanders to engage
"Imagine you receive a harmful comment on one of your usual posts. In this concept, the AI detects the hurtful comment and sends a prompt to your followers or other active users, inviting them to help you respond. Those who are notified can choose whether or not to reply, offering support or counterspeech voluntarily."
Participants were asked to reflect on each of these concepts, discussing what they saw as their potential strengths and limitations, and what potential (if any) they saw for AI moderation in this space.
Creators' perspectives on AI counterspeech
Below, you can see a summary of how creators reacted to each of the three concepts:
| Concept | Perceived Benefits | Perceived Drawbacks |
|---|---|---|
| Concept 1 (C1) AI as an independent counterspeaker |
|
|
| Concept 2 (C2) AI assisting creators in crafting counterspeech |
|
|
| Concept 3 (C3) AI prompting bystanders to engage |
|
|
Overall, one of the primary themes from our data can be described as "augmentation over replacement." Creators saw value in AI as a collaborator for counterspeech-based appraoches to content moderation, but saw plenty of potential pitfalls without a human in the loop. They felt others might not respond well to being lectured by an AI system, and were also strongly motivated to preserve their own voice and perspective in these conversations.
There was also a significant need for transparency and control, with participants concerned about potential backfire effects of unleashing AI counterspeakers into their communities. Will these agents unintentionally surface hateful comments by engaging in prolonged discussions? Will the presence of AI agents among one's audience leave viewers questioning the authenticity of future interactions? Could AI counterspeech agents be used by platforms to suppress controversial discussions or healthy criticism? Despite their potential, the potential for unintended consequences in this space is high, and any platform-wide interventions need to carefully engage with the needs of both creators and audience members.
Closing thoughts
Ultimately, its clear that despite its potential as a moderation tool, AI tools need to be designed with transparency, customizability, and human-in-the-loop principles in mind. It is also critical to consider many overlapping cultural contexts, including the particular cultures of one's country, platform, and specific community, as these can all impact the success of a given counterspeech strategy. Any tools deployed in this space must be carefully designed with an eye for nuance and accountability, and a lot more research is probably needed to understand the optimal design across different cultural contexts.
