Abstracts Track 2023

Area 1 - Applications

Nr: 4

Generation of Synthetic Invoices for the Training of Machine Learning Models


Rolandas Gricius and Igoris Belovas

Abstract: In the state-of-the-art perspective of Key Information Extraction (KIE) from semi-structured documents, business invoices play an exemplary role, both complex enough for scientific research and practical benefits. Machine Learning methods form the basis of the modern approach to the problem, edging out traditional template-based and rule-based approaches. One of the main characteristics of the modern approach is dependence on big annotated data sets. We know several examples: the SROIE dataset has 1000 English language annotated sales receipts (which are similar, but not identical to invoices) images, without text extracted (so the OCR step is needed to get text); the RVL-CDIP document dataset has a subset of English invoices numbering 25 000, but without annotations and no text extracted; ZUGFeRD dataset has approximately 100 invoices in English, German and French languages, fully annotated and with text extracted. Each of the aforementioned data sets has one or another flaw, making them unfit for training ma- chine learning models. Moreover, my research focuses on KIE from Lithuanian language invoices, thus confronting the problem of limited Natural Language Processing (NLP) resources with no public data sets at all. The crux of the problem is the sensitivity of the domain. The main obstacles to sharing such data set are: privacy - invoices can potentially contain personal data protected by privacy regulations, such as EU General Data Protection Regulation (GDPR) rules, California Consumer Privacy Act (CCPA), China Personal Information Protection Law (PIPL) and similar others; trade secret - for many companies counterparties and contents of invoice constitutes trade secrets and should not be published; variety - as the number of companies who would be willing to make invoices public is minimal, the variety of such data set would be limited as well, hurting the ability of machine learning to generalize. The additional hurdle common to many data sets used for supervised learning is a significant amount of work to annotate data, amplified by the size of data sets. Apart from trying to collect and annotate thousands of varied enough invoices issued in the language sought (possibly other than English), another approach would be to generate such a data set, includ- ing the annotations. There are several attempts to implement such generators. However, their main shortcoming is the ability to generate invoices in English or French only. The newest Spanish project generates Spanish language invoices specific to the electricity market (there are much regulatory required data on the electricity consumption to provide in the invoice). In this work, we have extended open-source generator software by Belhadj et al. to generate Lithua- nian language invoices and write synthetic annotation texts (ground truth) and their positions in the document to the separate xml file. Position of blocks containing text data are randomized with respect to the invoice layout. We have implemented several major modifications to accommodate the specifics of the Lithuanian language and the use of different date formats. We have taken into account the necessity of the generation of particular Lithuanian company names and addresses. This dataset is being pre- pared for training and benchmarking machine learning models for KIE from invoices in the Lithuanian language.