【数据分析篇】SNLI数据集

Stanford自然语言推理(SNLI)数据集,全称The Stanford Natural Language Inference (SNLI) Corpus。

https://nlp.stanford.edu/projects/snli/

简介

SNLI1.0包含570,000的人工手写英文句子对。 针对 推理前提(premise)与推理假设(hypothesis)之间是否存在逻辑关系,人工标注了以下三种标签:

  • entailment 蕴含、推理 \(p \Rightarrow h \)
  • contradiction 矛盾、对立 \(p \bot h \)
  • neutral 中立、无关 \(p \nLeftrightarrow h \)

用于自然语言推理 (Natural language inference,NLI), 也称为 (recognizing textual entailment, RTE)。

数据

示例数据

Text Judgments Hypothesis
A man inspects the uniform of a figure in some East Asian country. contradiction
C C C C C
The man is sleeping
An older and younger man smiling. neutral
N N E N N
Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people. contradiction
C C C C C
A man is driving down a lonely road.
A soccer game with multiple males playing. entailment
E E E E E
Some men are playing a sport.
A smiling costumed woman is holding an umbrella. neutral
N N E C N
A happy woman in a fairy costume holds an umbrella.

snli_1.0/snli_1.0_train.jsonl的第一行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"annotator_labels":[
"neutral"
],
"captionID":"3416050480.jpg#4",
"gold_label":"neutral",
"pairID":"3416050480.jpg#4r1n",
"sentence1":"A person on a horse jumps over a broken down airplane.",
"sentence1_binary_parse":"( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )",
"sentence1_parse":"(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))",
"sentence2":"A person is training his horse for a competition.",
"sentence2_binary_parse":"( ( A person ) ( ( is ( ( training ( his horse ) ) ( for ( a competition ) ) ) ) . ) )",
"sentence2_parse":"(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (VP (VBG training) (NP (PRP$ his) (NN horse)) (PP (IN for) (NP (DT a) (NN competition))))) (. .)))"
}

疑问

雷军
为什么有好几个judgement?
雷军
最终标签是综合了5个专家的意见,根据少数服从多数的原则得到的。

还额外提供了句子的两种解析树表示。

自然语言推理(NLI)数据在构造的过程中存在一系列的人工模式,这种模式的直接后果是模型可以在不需要知道推理前提(premise)的条件下就可以以 67%的准确率判断推理假设(hypothesis)是否是蕴含(entailment)中立(neural)或对立(contradiction)

数据统计 &分析

premise hypothesis label
a person on a horse jumps over a broken down airplane. a person is training his horse for a competition. neutral 原句没有体现training和competition
a person on a horse jumps over a broken down airplane. a person is at a diner, ordering an omelette. contradiction 这种关系,是否要借助外界数据?
a person on a horse jumps over a broken down airplane. a person is outdoors, on a horse. entailment
children smiling and waving at camera they are smiling at their parents neutral
children smiling and waving at camera there are children present entailment there are属于stop word,最好。但是又要和there are not相反
children smiling and waving at camera the kids are frowning contradiction
a boy is jumping on skateboard in the middle of a red bridge. the boy skates down the sidewalk. contradiction
a boy is jumping on skateboard in the middle of a red bridge. the boy does a skateboarding trick. entailment
a boy is jumping on skateboard in the middle of a red bridge. the boy is wearing safety equipment. neutral
an older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. an older man drinks his juice as he waits for his daughter to get off work. neutral
an older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. a boy flips a burger. contradiction
an older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. an elderly man sits in a small shop. neutral
two blond women are hugging one another. some women are hugging on vacation. neutral
two blond women are hugging one another. the women are sleeping. contradiction
two blond women are hugging one another. there are women showing affection. entailment
a few people in a restaurant setting, one of them is drinking orange juice. the people are eating omelettes. neutral
a few people in a restaurant setting, one of them is drinking orange juice. the people are sitting at desks in school. contradiction
a few people in a restaurant setting, one of them is drinking orange juice. the diners are at a restaurant. entailment
an older man is drinking orange juice at a restaurant. a man is drinking juice. entailment
an older man is drinking orange juice at a restaurant. two women are at a restaurant drinking wine. contradiction
  • premis比较具体,hyposies 简洁,抽象(比如male抽象成man,苹果抽象成苹果)
  • contradiction通常有反义词,比如 up down

数据读取