Name Length Coverage PoS Win+Tie Overall
Human (upper bound) 12.84 99.0098.11 100.00 97.13
Human (lower bound) 12.84 99.0098.11 50.00 48.57
gpt-4-0613 14.13 97.4491.78 50.44 45.11
gpt-4-1106-preview 14.90 96.3390.11 50.78 44.08
gpt-3.5-turbo 12.76 92.1183.00 49.78 38.06
Yi-34b-chat 13.45 80.1175.11 39.44 23.73
Pallas-0.5 14.83 86.6779.56 32.22 22.22
vicuna-13b-v1.5 15.02 85.8979.56 27.44 18.75
tulu-2-dpo-70b 17.89 88.7880.11 23.00 16.36
Mixtral-8x7B-Instruct-v0.120.15 84.1173.33 17.89 11.03
Llama-2-7b-chat-hf 16.06 88.5676.44 15.44 10.45
zephyr-7b-beta 15.76 82.4472.78 16.89 10.13
Yi-6b-chat 13.32 71.6763.56 22.11 10.07

Metrics:
  • Length: the number of words on average in the generated sentences.
  • Coverage: the percentage of examples where ALL given concepts are covered by model outputs.
  • PoS: the percentage of examples where the part-of-speech (PoS) of ALL given concepts are correct in model outputs.
  • Win+Tie Rate: the percentage of examples where GPT-4-turbo prefers the model outputs over the human-written references (or there is a tie).
  • Overall Score: the product of scores on Coverage, PoS, and Win/Tie Rate.
Links:

CommonGen Leaderboard (v1.1) (Archived)

Rank

Model

BLEU-4

CIDEr

SPICE

 

Upper Bound

46.49 37.64 52.43

1

June 15, 2022

DKMR^2

HKU & MSRA

Email   EMNLP 2022
44.33 19.538 34.589

2

May 15, 2022

RACo

Microsoft Cognitive Services Research Group

Email   EMNLP 2022
43.12 19.144 34.028

3

Jun 09, 2021

KFCNet

MSRA and Microsoft Ads

Email   EMNLP 2021
43.619 18.845 33.911

4

May 18, 2021

KGR^4

Alibaba and Xiamen University​.

Email   AAAI 2022
42.818 18.423 33.564

5

Mar 23, 2021

KFC (v1)

MSRA and Microsoft Ads

Email   EMNLP 2021
42.453 18.376 33.277

 

April 25, 2021

R^3-BART

Alibaba and Xiamen University​.

Email   AAAI 2022
41.954 17.706 32.961

 

July 1, 2021

PU-GEN + T5-large

Korea University

Email   Knowledge-Based Systems 2022
38.233 18.036 31.682

 

Jan 28, 2022

Imagine-and-Verbalize

USC/ISI

Email   ICLR 2022
40.565 17.716 31.291

 

Jan 13, 2021

RE-T5 (Retrieval-Enhanced T5)

Microsoft Cognitive Services Research Group

Email   ACL 2021
40.863 17.663 31.079

 

Oct 19, 2021

A* Neurologic (T5-large)

UW and AI2

Email   NAACL 2022 (best paper)
39.597 17.285 30.130

 

Aug 1, 2021

VisCTG (BART-large)

CMU-LTI

Email   arXiv
36.939 17.199 29.973

 

Aug 10, 2021

SAPPHIRE (T5-large)

CMU-LTI

Email   INLG 2021 (best paper)
37.119 16.901 29.751

 

Aug 26, 2020

KG-BART

University of Illinois at Chicago

Email   AAAI 2021
33.867 16.927 29.634

 

Oct 12, 2020

EKI-BART

MSRA and Fudan University

Email   COLING 2020
35.945 16.999 29.583

 

Oct 17, 2022

CoNT + T5-base

Fudan University

Email   NeruIPS 2022
31.962 15.128 28.855

 

Jun 1, 2020

T5-Large

Fine-tuned by USC-INK

T5 Paper
31.962 15.128 28.855

 

Jun 1, 2020

BART

Fine-tuned by USC-INK

BART Paper
31.827 13.976 27.995

 

Jun 1, 2020

UniLM

Fine-tuned by USC-INK

UniLM Paper
30.616 14.889 27.429

 

Jun 1, 2020

BERT-Gen

Fine-tuned by USC-INK

Code
23.468 12.606 24.822

 

Jun 1, 2020

GPT-2

Fine-tuned by USC-INK

GPT-2 Paper
23.730 12.187 23.567

 

Jun 1, 2020

T5-Base

Fine-tuned by USC-INK

T5 Paper
18.546 9.399 19.871

Submit to this leaderboard: You can submit your prediction by sending email to yuchen.lin@usc.edu with the title "CommonGen submission (your model name)" and the same format of this example prediction file.
Please note that we are not maintaining this version of leaderboard anymore. Please use this Github to submit your models: https://github.com/allenai/CommonGen-Eval

We use SPICE for ranking all methods because SPICE correlates our human evaluation the most (please check our paper for more details.)

The above results are based on our latest human references (v1.1) and the previous results on v1.0 can be found here.

The difference between v1.1 and v1.0 is about the human references of the test examples. We add one more human reference for each example in the test set (previously 4, now 5). Please find the deatils in Table 1,3 ; Figure 4 ;and Section 3.3, 3.4. Note that the train/dev data and test data's input are unchanged.

Misc.

Citation

@inproceedings{lin-etal-2020-commongen,
    title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
    author = "Lin, Bill Yuchen  and
      Zhou, Wangchunshu  and
      Shen, Ming  and
      Zhou, Pei  and
      Bhagavatula, Chandra  and
      Choi, Yejin  and
      Ren, Xiang",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
    pages = "1823--1840", 
}