Abstract: |
In the state-of-the-art perspective of Key Information Extraction (KIE) from semi-structured documents,
business invoices play an exemplary role, both complex enough for scientific research and practical
benefits. Machine Learning methods form the basis of the modern approach to the problem, edging out
traditional template-based and rule-based approaches.
One of the main characteristics of the modern approach is dependence on big annotated data sets. We
know several examples: the SROIE dataset has 1000 English language annotated sales receipts (which
are similar, but not identical to invoices) images, without text extracted (so the OCR step is needed
to get text); the RVL-CDIP document dataset has a subset of English invoices numbering 25 000, but
without annotations and no text extracted; ZUGFeRD dataset has approximately 100 invoices in English,
German and French languages, fully annotated and with text extracted.
Each of the aforementioned data sets has one or another flaw, making them unfit for training ma-
chine learning models. Moreover, my research focuses on KIE from Lithuanian language invoices, thus
confronting the problem of limited Natural Language Processing (NLP) resources with no public data
sets at all. The crux of the problem is the sensitivity of the domain.
The main obstacles to sharing such data set are: privacy - invoices can potentially contain personal
data protected by privacy regulations, such as EU General Data Protection Regulation (GDPR) rules,
California Consumer Privacy Act (CCPA), China Personal Information Protection Law (PIPL) and
similar others; trade secret - for many companies counterparties and contents of invoice constitutes trade
secrets and should not be published; variety - as the number of companies who would be willing to make
invoices public is minimal, the variety of such data set would be limited as well, hurting the ability of
machine learning to generalize. The additional hurdle common to many data sets used for supervised
learning is a significant amount of work to annotate data, amplified by the size of data sets.
Apart from trying to collect and annotate thousands of varied enough invoices issued in the language
sought (possibly other than English), another approach would be to generate such a data set, includ-
ing the annotations. There are several attempts to implement such generators. However, their main
shortcoming is the ability to generate invoices in English or French only. The newest Spanish project
generates Spanish language invoices specific to the electricity market (there are much regulatory required
data on the electricity consumption to provide in the invoice).
In this work, we have extended open-source generator software by Belhadj et al. to generate Lithua-
nian language invoices and write synthetic annotation texts (ground truth) and their positions in the
document to the separate xml file. Position of blocks containing text data are randomized with respect
to the invoice layout. We have implemented several major modifications to accommodate the specifics of
the Lithuanian language and the use of different date formats. We have taken into account the necessity
of the generation of particular Lithuanian company names and addresses. This dataset is being pre-
pared for training and benchmarking machine learning models for KIE from invoices in the Lithuanian
language. |