CommonGen

Name	Length	Coverage	PoS	Win+Tie	Overall
Human (upper bound)	12.84	99.00	98.11	100.00	97.13
Human (lower bound)	12.84	99.00	98.11	50.00	48.57
gpt-4-0613	14.13	97.44	91.78	50.44	45.11
gpt-4-1106-preview	14.90	96.33	90.11	50.78	44.08
gpt-3.5-turbo	12.76	92.11	83.00	49.78	38.06
Yi-34b-chat	13.45	80.11	75.11	39.44	23.73
Pallas-0.5	14.83	86.67	79.56	32.22	22.22
vicuna-13b-v1.5	15.02	85.89	79.56	27.44	18.75
tulu-2-dpo-70b	17.89	88.78	80.11	23.00	16.36
Mixtral-8x7B-Instruct-v0.1	20.15	84.11	73.33	17.89	11.03
Llama-2-7b-chat-hf	16.06	88.56	76.44	15.44	10.45
zephyr-7b-beta	15.76	82.44	72.78	16.89	10.13
Yi-6b-chat	13.32	71.67	63.56	22.11	10.07

Metrics:

Length: the number of words on average in the generated sentences.
Coverage: the percentage of examples where ALL given concepts are covered by model outputs.
PoS: the percentage of examples where the part-of-speech (PoS) of ALL given concepts are correct in model outputs.
Win+Tie Rate: the percentage of examples where GPT-4-turbo prefers the model outputs over the human-written references (or there is a tie).
Overall Score: the product of scores on Coverage, PoS, and Win/Tie Rate.

Links:

HuggingFace Dataset: https://huggingface.co/datasets/allenai/commongen_lite
Github: https://github.com/allenai/CommonGen-Eval

Rank	Model	BLEU-4	CIDEr	SPICE
	Upper Bound	46.49	37.64	52.43
1 June 15, 2022	DKMR^2 HKU & MSRA Email EMNLP 2022	44.33	19.538	34.589
2 May 15, 2022	RACo Microsoft Cognitive Services Research Group Email EMNLP 2022	43.12	19.144	34.028
3 Jun 09, 2021	KFCNet MSRA and Microsoft Ads Email EMNLP 2021	43.619	18.845	33.911
4 May 18, 2021	KGR^4 Alibaba and Xiamen University. Email AAAI 2022	42.818	18.423	33.564
5 Mar 23, 2021	KFC (v1) MSRA and Microsoft Ads Email EMNLP 2021	42.453	18.376	33.277
April 25, 2021	R^3-BART Alibaba and Xiamen University. Email AAAI 2022	41.954	17.706	32.961
July 1, 2021	PU-GEN + T5-large Korea University Email Knowledge-Based Systems 2022	38.233	18.036	31.682
Jan 28, 2022	Imagine-and-Verbalize USC/ISI Email ICLR 2022	40.565	17.716	31.291
Jan 13, 2021	RE-T5 (Retrieval-Enhanced T5) Microsoft Cognitive Services Research Group Email ACL 2021	40.863	17.663	31.079
Oct 19, 2021	A* Neurologic (T5-large) UW and AI2 Email NAACL 2022 (best paper)	39.597	17.285	30.130
Aug 1, 2021	VisCTG (BART-large) CMU-LTI Email arXiv	36.939	17.199	29.973
Aug 10, 2021	SAPPHIRE (T5-large) CMU-LTI Email INLG 2021 (best paper)	37.119	16.901	29.751
Aug 26, 2020	KG-BART University of Illinois at Chicago Email AAAI 2021	33.867	16.927	29.634
Oct 12, 2020	EKI-BART MSRA and Fudan University Email COLING 2020	35.945	16.999	29.583
Oct 17, 2022	CoNT + T5-base Fudan University Email NeruIPS 2022	31.962	15.128	28.855
Jun 1, 2020	T5-Large Fine-tuned by USC-INK T5 Paper	31.962	15.128	28.855
Jun 1, 2020	BART Fine-tuned by USC-INK BART Paper	31.827	13.976	27.995
Jun 1, 2020	UniLM Fine-tuned by USC-INK UniLM Paper	30.616	14.889	27.429
Jun 1, 2020	BERT-Gen Fine-tuned by USC-INK Code	23.468	12.606	24.822
Jun 1, 2020	GPT-2 Fine-tuned by USC-INK GPT-2 Paper	23.730	12.187	23.567
Jun 1, 2020	T5-Base Fine-tuned by USC-INK T5 Paper	18.546	9.399	19.871

Submit to this leaderboard: You can submit your prediction by sending email to yuchen.lin@usc.edu with the title "CommonGen submission (your model name)" and the same format of this example prediction file.
Please note that we are not maintaining this version of leaderboard anymore. Please use this Github to submit your models: https://github.com/allenai/CommonGen-Eval

We use SPICE for ranking all methods because SPICE correlates our human evaluation the most (please check our paper for more details.)

The above results are based on our latest human references (v1.1) and the previous results on v1.0 can be found here.

The difference between v1.1 and v1.0 is about the human references of the test examples. We add one more human reference for each example in the test set (previously 4, now 5). Please find the deatils in Table 1,3 ; Figure 4 ;and Section 3.3, 3.4. Note that the train/dev data and test data's input are unchanged.

Citation

@inproceedings{lin-etal-2020-commongen,
    title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
    author = "Lin, Bill Yuchen  and
      Zhou, Wangchunshu  and
      Shen, Ming  and
      Zhou, Pei  and
      Bhagavatula, Chandra  and
      Choi, Yejin  and
      Ren, Xiang",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
    pages = "1823--1840", 
}

CommonGen-Lite Leaderboard (400 examples for LLM Evaluation)

CommonGen Leaderboard (v1.1) (Archived)

Misc.

Citation