Meet OTT-QA!

A large-scale open-domain question answering dataset over heterogenesous information on the web.

Why OTT-QA Answering?

HIGH-QUALITY

Mechanical Turk;
Strict Quality Control

LARGE-SCALE

400K Wikipedia Tables;
5M hyperlinked passages;
45K natural questions.

HYRBID

Semantic Understanding;
Symbolic Reasoning.

Retrieval

Retrieve over the whole Wikipedia

Introduction

Existing open-domain question answering datasets focus on dealing with textual homonogeneous information. However, such assumption might be sub-optimal as the real-world knowledge is distributed over heterogeneous forms. In this paper, we simulate a more challenging setting where we don't know in advance where the answer of a given open question appears in a passage or a table. Furthermore, the question might require aggregating a table and multiple passages to answer. Thus, the model needs to scan the web to retrieve most relevant evidence in both forms for question answering.

Explore

We have designed an interface for you to view the data, please click here to explore the dataset and have fun!

Example

In the task, you are given a question and the whole Wikipedia corpus of tables and text, the goal is to identify the provenance from the large corpus to predict its final answer span:

Download (Train/Test Data, Code)

All the code and data are provided in github. The leaderboard is hosted in codalab

Contact

Have any questions or suggestions? Feel free to contact us!