CommonGen
A Constrained Text Generation Challengefor Generative Commonsense Reasoning
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, Xiang Ren
Name | Length | Coverage | PoS | Win+Tie | Overall |
---|---|---|---|---|---|
Human (upper bound) | 12.84 | 99.00 | 98.11 | 100.00 | 97.13 |
Human (lower bound) | 12.84 | 99.00 | 98.11 | 50.00 | 48.57 |
gpt-4-0613 | 14.13 | 97.44 | 91.78 | 50.44 | 45.11 |
gpt-4-1106-preview | 14.90 | 96.33 | 90.11 | 50.78 | 44.08 |
gpt-3.5-turbo | 12.76 | 92.11 | 83.00 | 49.78 | 38.06 |
Yi-34b-chat | 13.45 | 80.11 | 75.11 | 39.44 | 23.73 |
Pallas-0.5 | 14.83 | 86.67 | 79.56 | 32.22 | 22.22 |
vicuna-13b-v1.5 | 15.02 | 85.89 | 79.56 | 27.44 | 18.75 |
tulu-2-dpo-70b | 17.89 | 88.78 | 80.11 | 23.00 | 16.36 |
Mixtral-8x7B-Instruct-v0.1 | 20.15 | 84.11 | 73.33 | 17.89 | 11.03 |
Llama-2-7b-chat-hf | 16.06 | 88.56 | 76.44 | 15.44 | 10.45 |
zephyr-7b-beta | 15.76 | 82.44 | 72.78 | 16.89 | 10.13 |
Yi-6b-chat | 13.32 | 71.67 | 63.56 | 22.11 | 10.07 |
Rank | Model | BLEU-4 |
CIDEr | SPICE | |
---|---|---|---|---|---|
|
Upper Bound |
46.49 | 37.64 | 52.43 | |
1 June 15, 2022 |
DKMR^2 HKU & MSRA |
44.33 | 19.538 | 34.589 | |
2 May 15, 2022 |
RACo Microsoft Cognitive Services Research Group |
43.12 | 19.144 | 34.028 | 3 Jun 09, 2021 |
KFCNet MSRA and Microsoft Ads |
43.619 | 18.845 | 33.911 |
4 May 18, 2021 |
KGR^4 Alibaba and Xiamen University. |
42.818 | 18.423 | 33.564 | |
5 Mar 23, 2021 |
KFC (v1) MSRA and Microsoft Ads |
42.453 | 18.376 | 33.277 | |
April 25, 2021 |
R^3-BART Alibaba and Xiamen University. |
41.954 | 17.706 | 32.961 | |
July 1, 2021 |
PU-GEN + T5-large Korea University |
38.233 | 18.036 | 31.682 | |
Jan 28, 2022 |
Imagine-and-Verbalize USC/ISI |
40.565 | 17.716 | 31.291 | |
Jan 13, 2021 |
RE-T5 (Retrieval-Enhanced T5) Microsoft Cognitive Services Research Group |
40.863 | 17.663 | 31.079 | |
Oct 19, 2021 |
A* Neurologic (T5-large) UW and AI2 |
39.597 | 17.285 | 30.130 | |
Aug 1, 2021 |
VisCTG (BART-large) CMU-LTI |
36.939 | 17.199 | 29.973 | |
Aug 10, 2021 |
SAPPHIRE (T5-large) CMU-LTI |
37.119 | 16.901 | 29.751 | |
Aug 26, 2020 |
KG-BART University of Illinois at Chicago |
33.867 | 16.927 | 29.634 | |
Oct 12, 2020 |
EKI-BART MSRA and Fudan University |
35.945 | 16.999 | 29.583 | |
Oct 17, 2022 |
CoNT + T5-base Fudan University |
31.962 | 15.128 | 28.855 | |
Jun 1, 2020 |
T5-Large Fine-tuned by USC-INK T5 Paper |
31.962 | 15.128 | 28.855 | |
Jun 1, 2020 |
BART Fine-tuned by USC-INK BART Paper |
31.827 | 13.976 | 27.995 | |
Jun 1, 2020 |
UniLM Fine-tuned by USC-INK UniLM Paper |
30.616 | 14.889 | 27.429 | |
Jun 1, 2020 |
BERT-Gen Fine-tuned by USC-INK Code |
23.468 | 12.606 | 24.822 | |
Jun 1, 2020 |
GPT-2 Fine-tuned by USC-INK GPT-2 Paper |
23.730 | 12.187 | 23.567 | |
Jun 1, 2020 |
T5-Base Fine-tuned by USC-INK T5 Paper |
18.546 | 9.399 | 19.871 | |
We use SPICE for ranking all methods because SPICE correlates our human evaluation the most (please check our paper for more details.)
The above results are based on our latest human references (v1.1) and the previous results on v1.0 can be found here.
The difference between v1.1 and v1.0 is about the human references of the test examples. We add one more human reference for each example in the test set (previously 4, now 5). Please find the deatils in Table 1,3 ; Figure 4 ;and Section 3.3, 3.4. Note that the train/dev data and test data's input are unchanged.
@inproceedings{lin-etal-2020-commongen,
title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
author = "Lin, Bill Yuchen and
Zhou, Wangchunshu and
Shen, Ming and
Zhou, Pei and
Bhagavatula, Chandra and
Choi, Yejin and
Ren, Xiang",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
pages = "1823--1840",
}