Introduction

NumerSense is a new numerical commonsense reasoning probing task, with a diagnostic dataset consisting of 3,145 masked-word-prediction probes.

We propose to study whether numerical commonsense knowledge can be induced from pre-trained language models like BERT, and to what extent this access to knowledge robust against adversarial examples is. We hope this will be beneficial for tasks such as knowledge base completion and open-domain question answering.

Links:   [Paper]   [Data]   [Github]   [INK Lab]  

USC/ISI

Examples

Everday Objects (35.2%)

A bicycle has [MASK] tires.

Select the answer!

Biology (13.5%)

Most ants have [MASK] legs.

Select the answer!

Geometry (11.7%)

A cube has [MASK] faces.

Select the answer!

Unit Converting (6.3%)

A week is [MASK] days.

Select the answer!

Math (7.3%)

I will be [MASK] next year,as I am nine now.

Select the answer!

Physics (5.7%)

Water will freeze at [MASK] degrees centigrade.

Select the answer!

Geography (2.9%)

The world contains [MASK] continents.

Select the answer!

Others (17.5%)

There are [MASK] princes in the United States.

Select the answer!

Leaderboard

For submitting your prediction and check the lastest submissions, please check it at the eval.ai.

NumerSense-All (Core + Adversarial)

Rank

Model

Hit@1

Hit@2

Hit@3

 

Human Performance

88.3 (closed-book)
93.7 (open-book)
N/A N/A

1

T5-11B + GKP

University of Washington - 2021-9
72.47 85.57 91.58

2

T5 1.1 Zero-Shot +digits

ISI Waltham - 2021-04
66.18 82.80 89.64

3

T5-11B + IR

MOWGLI/USC INK - Jun Yan - 2021-01-10
65.10 81.56 88.33

4

T5-11B

Stanford - Yuhui Zhang - 2021-01-08
64.08 79.66 87.29

5

T5-11B (Closed-book QA)

Team Cosmic - Yizhong Wang - 2021-01-11
56.91 72.01 80.51

6

RoBERTa + UnifiedQA (T5-3B)

MICS ISI - Dong-Ho Lee - 2021-01-20
56.33 73.30 82.33

7

RoBERTa-Large (Fine-tuned)

47.58 66.34 76.74

8

BERT-Large (Fine-tuned)

43.68 66.41 72.87

9

RoBERTa-Large (Zero-shot)

35.89 58.07 74.09

10

BERT-Large (Zero-shot)

27.15 52.92 70.25

11

RoBERTa-base (Zero-shot)

26.80 50.57 66.72

12

BERT-base (Zero-shot)

25.30 48.70 64.84

13

GPT-2 (Zero-shot)

24.76 44.28 62.40

NumerSense-Core

Rank

Model

Hit@1

Hit@2

Hit@3

 

Human Performance

89.7 (closed-book)
96.3 (open-book)
N/A N/A

1

T5-11B + GKP

University of Washington - 2021-9
79.24 89.93 94.17

2

T5 1.1 Zero-Shot +digits

ISI Waltham - 2021-04
72.61 87.10 92.23

3

T5-11B + IR

MOWGLI/USC INK - Jun Yan - 2021-01-10
70.41 84.81 90.99

4

T5-11B

Stanford - Yuhui Zhang - 2021-01-08
70.23 83.57 90.11

5

T5-11B (Closed-book QA)

Team Cosmic - Yizhong Wang - 2021-01-11
62.51 75.77 82.40

6

RoBERTa + UnifiedQA (T5-3B)

MICS ISI - Dong-Ho Lee - 2021-01-20
60.87 76.33 84.54

7

RoBERTa-Large (Fine-tuned)

54.22 69.53 78.97

8

BERT-Large (Fine-tuned)

50.19 66.23 74.72

9

RoBERTa-Large (Zero-shot)

46.11 66.08 79.42

10

BERT-Large (Zero-shot)

37.54 62.10 76.86

11

RoBERTa-base (Zero-shot)

33.39 58.83 71.91

12

BERT-base (Zero-shot)

31.98 56.01 70.67

13

GPT-2 (Zero-shot)

30.04 51.06 67.58

Citation.


@inproceedings{lin2020numersense,
    title={Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models},
    author={Bill Yuchen Lin and Seyeon Lee and Rahul Khanna and Xiang Ren}, 
    booktitle={Proceedings of EMNLP},
    year={2020},
    note={to appear}
}