While Community Notes has the potential to be extremely effective, the difficult job of content moderation benefits from a mix of different approaches. As a professor of natural language processing at MBZUAI, I’ve spent most of my career researching disinformation, propaganda, and fake news online. So, one of the first questions I asked myself was: will replacing human factcheckers with crowdsourced Community Notes have negative impacts on users?
Wisdom of crowds
Community Notes got its start on Twitter as Birdwatch. It’s a crowdsourced feature where users who participate in the program can add context and clarification to what they deem false or misleading tweets. The notes are hidden until community evaluation reaches a consensus—meaning, people who hold different perspectives and political views agree that a post is misleading. An algorithm determines when the threshold for consensus is reached, and then the note becomes publicly visible beneath the tweet in question, providing additional context to help users make informed judgments about its content.
Community Notes seems to work rather well. A team of researchers from University of Illinois Urbana-Champaign and University of Rochester found that X’s Community Notes program can reduce the spread of misinformation, leading to post retractions by authors. Facebook is largely adopting the same approach that is used on X today.
Having studied and written about content moderation for years, it’s great to see another major social media company implementing crowdsourcing for content moderation. If it works for Meta, it could be a true game-changer for the more than 3 billion people who use the company’s products every day.
That said, content moderation is a complex problem. There is no one silver bullet that will work in all situations. The challenge can only be addressed by employing a variety of tools that include human factcheckers, crowdsourcing, and algorithmic filtering. Each of these is best suited to different kinds of content, and can and must work in concert.
Spam and LLM safety
There are precedents for addressing similar problems. Decades ago, spam email was a much bigger problem than it is today. In large part, we’ve defeated spam through crowdsourcing. Email providers introduced reporting features, where users can flag suspicious emails. The more widely distributed a particular spam message is, the more likely it will be caught, as it’s reported by more people.
Another useful comparison is how large language models (LLMs) approach harmful content. For the most dangerous queries—related to weapons or violence, for example—many LLMs simply refuse to answer. Other times, these systems may add a disclaimer to their outputs, such as when they are asked to provide medical, legal, or financial advice. This tiered approach is one that my colleagues and I at the MBZUAI explored in a recent study where we propose a hierarchy of ways LLMs can respond to different kinds of potentially harmful queries. Similarly, social media platforms can benefit from different approaches to content moderation.
Automatic filters can be used to identify the most dangerous information, preventing users from seeing and sharing it. These automated systems are fast, but they can only be used for certain kinds of content because they aren’t capable of the nuance required for most content moderation.