Human communication relies on common ground (CG), the mutual knowledge and beliefs shared by participants, to produce coherent and interesting conversations. Existing response generation (RG) models, however, produce generic and dull responses in dialogues because they act reflexively, failing to explicitly model CG, both due to the lack of CG in training data and the standard RG training procedure.

We introduce Reflect, a dataset that annotates dialogues with explicit CG (materialized as inferences approximating shared knowledge and beliefs) and solicits 9k diverse human-generated responses each following one common ground. Using Reflect, we showcase the limitations of current dialogue data and RG models: less than half of the responses in current data is rated as high quality (sensible, specific, and interesting) and models trained using this data have even lower quality, while most Reflect responses are judged high quality.

Next, we analyze whether CG can help models produce better quality responses by using Reflect CG to guide RG models. Surprisingly, we find that simply prompting GPT3 to "think" about CG generates 30% more quality responses, showing promising benefits to integrating CG into the RG process.

Links:   [Paper]   [Data]   [Github]   [INK Lab]  


Data Construction

We design a two-stage data collection process by first asking crowdsourcing workers to answer different inference questions eliciting beliefs about common ground (e.g., what is the speaker feeling right now?) Answers rely on common sense, and adopt the point of view of the conversational respondent. We use these QA pairs to approximate various (non-exhaustive) inference dimensions to extend the common ground (e.g., empathy and event causality).

Our second step converts these CG into dialogue responses by asking different workers to write a coherent response based on the answer/inference collected in the first stage. Our collected data Reflect contains 9k diverse responses from 600 dialogue contexts, based on 5 inference dimensions for CG.


Experiments and Analysis

Using Reflect, we first test our hypothesis that explicitly modeling CG and using CG to construct responses creates more engaging conversations. We conduct human evaluation to compare the quality of responses between Reflect and “reflex” style. As shown in the figure below, Reflect consists of dialogue responses that are on average more specific (20%) and interesting (13%) than the original data, while having slightly lower sensibility (4%) ratings. When comparing the percentages of responses that satisfy all three criteria, i.e., quality responses, we find that there are substantially more (18%) such responses in Reflect than in original data.


After showing that explicitly integrating inferencebased CG helps humans produce more specific and interesting dialogue responses, we now test if this also holds for neural RG models. The figure below shows that simply prompting GPT-3 with CG inference questions boosts the generated response quality by 30% and similarly for fine-tuning BlenderBot on Reflect. More details in paper.


We are curious to find out which inference dimension helps models the most (and which the least)? Here we show that, intriguingly, on some dimensions, GPT3-FS-InfQ can produce significantly better responses than human responses from Reflect, especially event-based: "What might have happened before"" and "what might happen after?" and emotion-based CG about the other speaker "What is A (speaker1) feeling now?". However, on "How would you describe A", humans responses grounded on this question are much better.



		title={Reflect Not Reflex: Inference-Based Common Ground Improves Dialogue Response Quality},
		author={Zhou, Pei and Cho, Hyundong J. and Jandaghi, Pegah and Lee, Dong-Ho and Lin, Bill Yuchen and Pujara, Jay and Ren, Xiang},
		booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},