Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
626 | - | 347 | Receipt | English | Entity Extraction | Entity F1-score |
SROIE is a dataset for the 2019 ICDAR Robust Reading Challenge on Scanned Receipts OCR and Information Extraction competition. It contains 973 samples, 626 for training and 347 for testing. Each receipt contains four kinds of key entities: Company
, Address
, Date
, and Total
.
Line-level OCR results and texts of key entities are available for each sample. However, it is important to note that the two annotations are not aligned. In order to perform Entity Extraction using token tagging approaches like LayoutLM, it is necessary to have tags for each word. This can be achieved either through rule-based methods or by manually re-labeling the data.
Indeed, the quality of data annotation plays a crucial role in the Entity Extraction performance. We conduct experiments with ViBERTgrid. When the model is trained with high quality annotations (re-labelled manually), the entity F1 can reach 97+. While training with poor quality annotations (rule-based matching) results in a entity F1 of 60.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
800 | 100 | 100 | Receipt | English | Entity Extraction | Entity F1-score | |
Entity Linking | Linking F1-score | ||||||
Document Structure Parsing | Structured Field F1-score, TED Acc |
CORD is an English receipt dataset proposed by Clova-AI. 1000 samples are currently publicly available, where 800 are for training, 100 for validation, and 100 for testing. The receipt images are captured by cameras, which may introduce interference such as paper bending and background noise. However, the dataset includes high-quality annotations with key labels for each word and linking between entities. It encompasses four main categories of key information, and can be further divided into 30 sub-key fields. Notably, the entities in CORD are hierarchically related, making the task of extracting all the structured fields particularly challenging for models.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
149 | - | 50 | Forms | English | Entity Extraction | Entity F1-score | |
Entity Linking | Linking F1-score |
A dataset for Text Detection, Optical Character Recognition, Spatial Layout Analysis and Form Understanding. It consists of 199 fully annotated forms, containing a total of 31485 words, 9707 semantic entities and 5304 relations. For each text segment and word, the dataset provides the corresponding OCR result. Furthermore, the annotations also include the category of each paragraph and linkings between entities.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
149*7 | - | 50*7 | Forms | Chinese, Japanese, Spanish, French, Italian, German, Portuguese | Entity Extraction | Entity F1-score | |
Entity Linking | Linking F1-score |
XFUND is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). It is an extension of the FUNSD dataset, the annotations and evaluation metric are the same as FUNSD.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
1183 | - | 311 | Paper Head | Chinese | Entity Extraction | Entity F1-score |
The EPHOIE Dataset comprises 1,494 images that were collected and scanned from real examination papers from different schools in China. The authors of the dataset have cropped the paper head regions, which contain all the key information. The texts in the dataset consist of both handwritten and printed Chinese characters, arranged in horizontal and arbitrary quadrilateral shapes. The dataset also includes complex layouts and noisy backgrounds, which contribute to its generalization capabilities. In total, the dataset encompasses 11 key categories, such as name, class, and student ID. Each character in the dataset is annotated, allowing for the direct application of token classification models using the original labels.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
2989 | - | 1200 | Receipt | Chinese English |
Structure Parsing | Entity Matching Score |
The CER-VIR dataset contains receipts in both Chinese and English. Each sample contains key information including company, date, total, tax and items. The item field within each sample can be further divided into three subkeys: item name, item count, and item unit price. The task associated with this dataset involves extracting all the key fields from a given sample, including all the subkeys within the item field.
To ensure consistency, the extracted result should be properly formatted. For instance, date entities should be provided in the format of YYYY-MM-DD. The dataset also includes OCR results for reference. Additionally, the annotations of the key entities are provided in formatted string forms, which may differ from the actual content displayed in the image. This aspect of the dataset makes the task significantly more challenging compared to other existing benchmarks in the field of Visual Information Extraction.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
600 | - | 400 | Receipt, Bills | Chinese, English | Entity Extraction | Entity F1-score | |
Entity Linking | Linking F1-score |
There are 1000 images in the SIBR, including 600 Chinese invoices, 300 English bills of entry, and 100 bilingual receipts. SIBR is well annotated with 71227 entity-level boxes and 39004 links. In comparison to other real scene datasets like SROIE and EPHOIE, SIBR offers a wider range of appearances and more diverse structures.
The document images within the SIBR dataset pose additional challenges as they are sourced from real-world applications. These challenges include severe noise, uneven illumination, image deformation, printing shift, and complicated links. Similar to FUNSD, the SIBR dataset contains 3 kinds of key information including question
, answer
, and header
. It is worth noting that the entity with multiple lines in SIBR is represented by text segments and intra-links between them. Models are required to extract the full entity given only the text segment annotations.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
271440 | 30160 | 400 | Train Ticket | Chinese | Entity Extraction | Mean Entity Accuracy | |
88200 | 9800 | 2000 | Passport | Mean Entity Accuracy | |||
178200 | 19800 | 2000 | Business Card | Entity F1-score |
The EATEN dataset covers three scenarios: Train Ticket, Passport, and Business Card.
The train ticket subset includes a total of 2k real images and 300k synthetic images. Real images were shot in a finance department with inconsistent lighting conditions, orientations, background noise, imaging distortions. The train tickets contains 8 key categories.
The passport subset includes a total 100k synthetic images with 7 key categories.
The business card subset contains 200k synthetic images with 10 key categories. The positions of the key entities are not constant and some entities may not exist, which is a challenge for applying VIE.
The Mean Entity Accuracy is calculated as shown below
$$
mEA = \sum_{i=0}^{I-1}\mathbb{I}(y^i==g^i)/I
$$
where
The WildReceipt dataset is introduced by the mmocr repository, which follow the Apache License 2.0.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
1740 | - | 472 | Receipt | English | Entity Extraction | Entity F1-score | |
Entity Linking | Node F1-score & Edge F1-score |
The WildReceipt dataset has two version: the CloseSet and OpenSet.
The CloseSet divides text boxes into 26 categories. There are 12 key-value pairs of fine-grained key information categories, such as (Prod_item_value
, Prod_item_key
), (Prod_price_value
, Prod_price_key
), and (Tax_value
, Tax_key
), plus two more "do not care" categories: Ignore
and Others
. The objective of the CloseSet is to apply Entity Extraction.
The OpenSet have only 4 possible categories: background
, key
, value
, and others
. The connectivity between nodes are annotated as edge labels. If a pair of key-value nodes have the same edge label, they are connected by an valid edge. The objective of the OpenSet is to extract pairs from the given sample.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
254 | 83 | 203 | Contracts | English | Entity Extraction | Entity F1-score | |
1729 | 440 | 609 | Financial Reports |
|
The Kleister dataset contains two subset: NDA and Charity.
The goal of the NDA task is to Extract the key information from NDAs (Non-Disclosure Agreements) about the involved parties
, jurisdiction
, contract term
, and effective date
. It contains 540 documents with 3229 pages.
The goal of the Charity task is to retrieve 8 kinds of key information including charity address (but not other addresses), charity number, charity name and its annual income and spending in GBP (British Pounds) in PDF files published by British charities. It contains 2788 financial reports with 61643 pages in total.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
10/50/100/200 | - | 300 | Registration Forms | English | Entity Extraction | Type-Aware Matching F1-score | |
10/50/100/200 | - | 300 | Political Advertisements | Entity Extraction & Document Parsing |
This benchmark includes two datasets: Ad-buy Forms and Registration Forms. These documents consist of structured data with a comprehensive schema, including nested repeated fields. They have complex layouts that clearly distinguish them from long text documents and incorporate a variety of templates. Additionally, the OCR results are of high-quality. The authors have provided token-level annotations for the ground truth, ensuring there is no ambiguity when mapping the annotations to the input text.
The Registration Forms subset contains 6 types of key fields: file_date
, foreign_principal_name
, registrant_name
, registration_ID
, signer_name
, and signer_title
. The Ad-buy Forms contains 9 key fields including advertiser
, agency
, contract_ID
, flight_start_date
, flight_end_date
, ross_amount
, product
, TV_address
, and property
. Further more, nested-fields containing line_item (description
, start_date
, end_date
, sub_price
) are also annotated in the Ad-buy Forms subset.
It is common practice to compare the extracted entity with the ground-truth using strict string matching. However, such a simple approach may lead to unreasonable results in many scenarios. For example, “$ 40,000” does not match with “40,000” because of the missing dollar sign when extracting the total price from a receipt, and “July 1, 2022” does not match with “07/01/2022”. Dates may be present in different formats in different parts of the document, and a model should not be arbitrarily penalized for picking the wrong instance. We implement different matching functions for each entity name based on the type associated with that entity. The VRDU evaluation scripts will convert all price values into a numeric type before comparison. Similarly, date strings are parsed, and a standard date-equality function is used to determine equality.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
2250 | - | 750 | Product Nutrition Tables | English | Entity Extraction | Entity F1-score |
The images in POIE contain Nutrition Facts labels from various commodities in the real world, which have larger variances in layout, severe distortion, noisy backgrounds, and more types of entities than existing datasets. POIE contains images with variable appearances and styles (such as structured, semi-structured, and unstructured styles), complex layouts, and noisy backgrounds distorted by folds, bends, deformations, and perspectives. The types of entities in POIE reach 21, and a few entities have different forms, which is very common and pretty challenging for VIE in the wild. Besides there are often multiple words in each entity, which appears zero or once in every image. These properties mentioned above can help enhance the robustness and generalization of VIE models to better cope with more challenging applications.
Number of Samples | Type | Language | Access Link | Task | Evaluation Metric | ||
---|---|---|---|---|---|---|---|
Train | Validate | Test | |||||
48280 | - | 5255 | Mixed Form-like Documents | English | Entity Extraction | ED-IOU-based Entity F1-score | |
Pair Extraction |
IOU-based Pair F1-score ED-based Pair F1-score ED-IOU-based Pair F1-score |
KVP10k is the largest dataset available for KVP extraction. It features a broad array of keys and precise annotations, with text labeled as keys or values, providing a solid basis for training and evaluation. The authors also proposed a novel metric for evaluating the Pair Extraction Performance.