Abstract
Conversational moderation of online communities is crucial to maintainingcivility for a constructive environment, but it is challenging to scale andharmful to moderators. The inclusion of sophisticated natural languagegeneration modules as a force multiplier to aid human moderators is atantalizing prospect, but adequate evaluation approaches have so far beenelusive. In this paper, we establish a systematic definition of conversationalmoderation effectiveness grounded on moderation literature and establish designcriteria for conducting realistic yet safe evaluation. We then propose acomprehensive evaluation framework to assess models' moderation capabilitiesindependently of human intervention. With our framework, we conduct the firstknown study of language models as conversational moderators, finding thatappropriately prompted models that incorporate insights from social science canprovide specific and fair feedback on toxic behavior but struggle to influenceusers to increase their levels of respect and cooperation.