Extracting structure from text documents based on machine learning
This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents...
Збережено в:
Дата: | 2023 |
---|---|
Автори: | , |
Формат: | Стаття |
Мова: | English |
Опубліковано: |
Інститут програмних систем НАН України
2023
|
Теми: | |
Онлайн доступ: | https://pp.isofts.kiev.ua/index.php/ojs1/article/view/517 |
Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
Назва журналу: | Problems in programming |
Репозитарії
Problems in programmingid |
pp_isofts_kiev_ua-article-517 |
---|---|
record_format |
ojs |
resource_txt_mv |
ppisoftskievua/4f/c6cd3602963057279bf066bfc819ab4f.pdf |
spelling |
pp_isofts_kiev_ua-article-5172023-06-25T05:20:21Z Extracting structure from text documents based on machine learning Витяг структури з текстових документів на основі машинного навчання Kudim, K.A. Proskudina, G.Yu. natural language processing; information extraction; machine learning; neural network UDC 004.82 обробка природної мови; видобуток інформації; машинне навчання; нейронні мережі УДК 004.82 This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents, converting a variety of file formats into plain text, and manual labeling each document structure. Then documents are split into tokens and into paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network. The model is trained and validated on the selected data subsets. Trained model results evaluation is presented. The final performance is calculated per label using precision, recall, and F1 measures, and overall average. The trained model can be used to extract sections of documents bearing similar structure.Prombles in programming 2022; 3-4: 154-160 Дослідження присвячене методу, що вирішує задачу автоматичного витягу структури з слабо структурованих текстових документів за допомогою штучної нейронної мережі. Метод складається з підготовки даних, побудови та навчання моделі та оцінки результатів. Підготовка даних включає збирання корпусів документів, перетворення різних форматів файлів у звичайний текст і ручне маркування структури кожного документа. Потім документи розбиваються на слова та абзаци. Абзаци тексту представлені як вектори ознак для забезпечення вхідних даних для нейронної мережі. Модель навчена та перевірена на вибраних підмножинах даних. Представлена оцінка результатів навченої моделі. Остаточна ефективність розраховується для кожної мітки з використанням F1-оцінки, точності та повноти, а також загального середнього значення. Навчену модель можна використовувати для витягу розділів документів, що мають подібну структуру.Prombles in programming 2022; 3-4: 154-160 Інститут програмних систем НАН України 2023-01-23 Article Article application/pdf https://pp.isofts.kiev.ua/index.php/ojs1/article/view/517 10.15407/pp2022.03-04.154 PROBLEMS IN PROGRAMMING; No 3-4 (2022); 154-160 ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ; No 3-4 (2022); 154-160 ПРОБЛЕМИ ПРОГРАМУВАННЯ; No 3-4 (2022); 154-160 1727-4907 10.15407/pp2022.03-04 en https://pp.isofts.kiev.ua/index.php/ojs1/article/view/517/570 Copyright (c) 2023 PROBLEMS IN PROGRAMMING |
institution |
Problems in programming |
baseUrl_str |
https://pp.isofts.kiev.ua/index.php/ojs1/oai |
datestamp_date |
2023-06-25T05:20:21Z |
collection |
OJS |
language |
English |
topic |
natural language processing information extraction machine learning neural network UDC 004.82 |
spellingShingle |
natural language processing information extraction machine learning neural network UDC 004.82 Kudim, K.A. Proskudina, G.Yu. Extracting structure from text documents based on machine learning |
topic_facet |
natural language processing information extraction machine learning neural network UDC 004.82 обробка природної мови видобуток інформації машинне навчання нейронні мережі УДК 004.82 |
format |
Article |
author |
Kudim, K.A. Proskudina, G.Yu. |
author_facet |
Kudim, K.A. Proskudina, G.Yu. |
author_sort |
Kudim, K.A. |
title |
Extracting structure from text documents based on machine learning |
title_short |
Extracting structure from text documents based on machine learning |
title_full |
Extracting structure from text documents based on machine learning |
title_fullStr |
Extracting structure from text documents based on machine learning |
title_full_unstemmed |
Extracting structure from text documents based on machine learning |
title_sort |
extracting structure from text documents based on machine learning |
title_alt |
Витяг структури з текстових документів на основі машинного навчання |
description |
This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents, converting a variety of file formats into plain text, and manual labeling each document structure. Then documents are split into tokens and into paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network. The model is trained and validated on the selected data subsets. Trained model results evaluation is presented. The final performance is calculated per label using precision, recall, and F1 measures, and overall average. The trained model can be used to extract sections of documents bearing similar structure.Prombles in programming 2022; 3-4: 154-160 |
publisher |
Інститут програмних систем НАН України |
publishDate |
2023 |
url |
https://pp.isofts.kiev.ua/index.php/ojs1/article/view/517 |
work_keys_str_mv |
AT kudimka extractingstructurefromtextdocumentsbasedonmachinelearning AT proskudinagyu extractingstructurefromtextdocumentsbasedonmachinelearning AT kudimka vitâgstrukturiztekstovihdokumentívnaosnovímašinnogonavčannâ AT proskudinagyu vitâgstrukturiztekstovihdokumentívnaosnovímašinnogonavčannâ |
first_indexed |
2025-07-17T09:35:54Z |
last_indexed |
2025-07-17T09:35:54Z |
_version_ |
1837886294461513728 |
fulltext |
154
Моделі і засоби систем баз даних та знань
УДК 004.82 https://doi.org/10.15407/pp2022.03-04.154
EXTRACTING STRUCTURE FROM TEXT DOCUMENTS
BASED ON MACHINE LEARNING
Kuzma Kudim, Galyna Proskudina
This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network.
For the method to work it requires a set of manually labeled documents to train the network. The trained model can be used to extract
sections of documents bearing similar structure.
Keywords: natural language processing, information extraction, machine learning, neural network.
Дослідження присвячене методу, що вирішує задачу автоматичного витягу структури з слабо структурованих текстових доку-
ментів за допомогою штучної нейронної мережі. Для того, щоб цей метод працював, потрібен розмічений вручну набір докумен-
тів для навчання мережі. Навчену модель можна використовувати для витягу розділів документів, що мають подібну структуру.
Ключові слова: обробка природної мови, видобуток інформації, машинне навчання, нейронні мережі.
Introduction
There are a lot of text documents that have rich representational formatting, easily readable and understand-
able by human but not intended for automatic processing. Examples are scientific papers, legal documents, books. All
of them have implicit logical structure like title page with title and author, publisher’s imprint, chapters, references. If
we make this logical structure explicit then it can be automatically processed. And then it can be used either as meta-
data describing the document or as input for further fine-grained information extraction.
Here we describe a method that facilitates the task of extracting structure from the text documents using an
artificial neural network. For the method to work it requires a set of manually labeled documents to train the network.
The trained model can be used to extract sections of documents bearing similar structure.
Previously we already described two other methods of data extraction from semi-structured text documents.
One based on detecting patterns using regular expressions and another based on linguistic rules [1, 2]. Both of these
methods require special skills to set up them for a particular type of documents, and to update the system for the
changed structure. The method based on machine learning described here has the benefit of not requiring programming
skills for usage. The initial set up requires only an accurately labeled set of documents, and this labeling can be made
by any person with basic understanding of the target structure of the document in the usual sense.
Overview
The paper consists of three main sections, as follows.
First of all, data should be prepared to train, validate and evaluate the model. Data preparation includes collect-
ing corpora of documents, converting a variety of file formats into plain text, and manual labeling each document struc-
ture. Finally, the dataset is split into three subsets for model training, validation and test in 70/15/15 ratio respectively.
Building and training the model is the central part of the work. Document is split into tokens and then into
paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network that consists
of three fully connected layers. The model is trained and validated on the selected data subsets.
After the model is trained showing a good F1 score on validation dataset for the selected features, it’s time to
evaluate the results on a very new data, i.e. test dataset. The final performance is calculated per label using precision,
recall, and F1 measures, and overall average.
Data preparation
Corpora. A selected subset from the thesis corpora from the National library of Ukraine by V.I.Vernadsky is
used as a dataset. The whole corpora consists of nearly 65000 documents. A subset of 100 theses is selected and split
into 70 documents as training set, 15 as validation set, and 15 as test dataset for final evaluation.
Conversion to plain text. The selected documents are in doc and rtf formats. As a preliminary step, this variety
of file formats is converted to plain text using LibreOffice (https://www.libreoffice.org) from command line as follows:
soffice --headless --convert-to txt --outdir out_dir in_file
Output text files are in UTF-8 encoding with BOM signature at the file start, so additionally the first three bytes
of each file are removed.
© К.О. Кудім, Г.Ю. Проскудіна, 2022
ISSN 1727-4907. Проблеми програмування. 2022. № 3-4. Спеціальний випуск
155
Моделі і засоби систем баз даних та знань
Labeling. Our goal is to select top-level sections of the document that are potentially useful for further infor-
mation extraction. That means, from one side, we are not interested in thesis main thematic content, and, from the other
side, we don’t care of fine-grained data contained deeper in each section on this stage. The factual data extraction can
be the next step after larger document sections are successfully extracted.
19 labels shown in Table 1 are selected to reflect the desired top level logical structure of the thesis document.
Each label covers the whole section of the document, although sections can differ much in size and inner complexity.
For example, a section labeled SPEC covers speciality digital code and name, or maybe a list of such records. Another
section labeled PUBLICATIONS includes all listed publications as a whole section of the document. Fine-grained
information extraction is out of scope of current work.
Special label O is used internally to represent absence of any specific label.
Table 1. Structural labels for thesis document
Label Document section
MAIN_ORG Organization this document is related to in general, at the top of the title page
AUTHOR Thesis author
UDK UDC classifier
TITLE Thesis title
SPEC Thesis speciality code and name
DEGREE Target scientific degree of the thesis
CITY_YEAR City and year in the footer of the title page
WORK_ORG Author’s work organization
SUPERVISOR Scientific supervisor
OPPONENTS Scientific opponents
LEAD_ORG Leading organization for the thesis
DEFENSE Information about thesis defense event
LIBRARY Where the thesis manuscript is stored
SENT When participants were notified by mail
SECRETARY Scientific secretary
PUBLICATIONS Author’s publications for the thesis
ABSTRACT_UK Abstract in Ukrainian
ABSTRACT_EN Abstract in English
ABSTRACT_RU Abstract in Russian
O Used internally to represent empty label
All 100 documents from the corpora are manually labeled using Label Studio (https://labelstud.io/)
open source data labeling tool as shown in Figure 1, and exported in JSON format.
Моделі і засоби систем баз даних та знань
Fig. 1. Manually labeling process using Label Studio
Model
Feature vector representation of paragraph. Document is represented as a sequence of paragraphs, and each
paragraph is converted to a feature vector of N dimensions. Paragraph features are listed in Table 2. First Ns = 12 fea-
tures are quite simple, each reflecting one statistic value in a paragraph [3]. For untrivial features the explanation fol-
lows.
Amongst other features a vector representing dictionary word count is used. A short dictionary of Nd = 105
words is built of the most frequent words met in labeled sections, 10 most frequent words for each label over all docu-
ments. The dictionary word vector is concatenated to the main feature vector. This feature adds Nd dimensions to the
feature vector.
The same goes for character frequencies in a paragraph. Dictionary for characters from the training set contains
Nc = 293 characters. Here the paragraph is considered as a bag of characters and the frequency of each character is cal-
culated. It is also concatenated to the main feature vector adding Nc dimensions.
Another special feature represents a non-empty label preceding the current paragraph in the document. This fea-
ture catches the global order of labels. This feature has Nl = 20 dimensions that is equal to the count of non-empty
structural labels. When using a window of nearby paragraphs for model training then this global feature is concatenated
only once to the input vector. How the window is used is described in the next section.
From the above we can see that the feature vector representing the paragraph has Np = Ns + Nd + Nc + Nl = 430
dimensions. Specific numbers of simple features, word and character dictionary size, label count can vary depending
not only on a task in question but also when optimizing trained model scores.
Table 2. Paragraph features
Feature Comment
Paragraph start position Paragraph position measured in character
Paragraph size Paragraph size measured in tokens
Words count Count tokens consisting of cyrillic and latin letters only
Numbers count Count tokens consisting of digits
Lower-cased word count Count words with all characters in lower case
Capitalized word count Count tokens with first character in upper case
Uppercased word count Count words with all characters in upper case
Dots count Count of dot characters in a paragraph
Fig. 1. Manually labeling process using Label Studio
156
Моделі і засоби систем баз даних та знань
Model
Feature vector representation of paragraph. Document is represented as a sequence of paragraphs, and each
paragraph is converted to a feature vector of N dimensions. Paragraph features are listed in Table 2. First Ns = 12 features
are quite simple, each reflecting one statistic value in a paragraph [3]. For untrivial features the explanation follows.
Amongst other features a vector representing dictionary word count is used. A short dictionary of Nd = 105 words
is built of the most frequent words met in labeled sections, 10 most frequent words for each label over all documents. The
dictionary word vector is concatenated to the main feature vector. This feature adds Nd dimensions to the feature vector.
The same goes for character frequencies in a paragraph. Dictionary for characters from the training set con-
tains Nc = 293 characters. Here the paragraph is considered as a bag of characters and the frequency of each character
is calculated. It is also concatenated to the main feature vector adding Nc dimensions.
Another special feature represents a non-empty label preceding the current paragraph in the document. This
feature catches the global order of labels. This feature has Nl = 20 dimensions that is equal to the count of non-empty
structural labels. When using a window of nearby paragraphs for model training then this global feature is concatenated
only once to the input vector. How the window is used is described in the next section.
From the above we can see that the feature vector representing the paragraph has Np = Ns + Nd + Nc + Nl =
430 dimensions. Specific numbers of simple features, word and character dictionary size, label count can vary depend-
ing not only on a task in question but also when optimizing trained model scores.
Table 2. Paragraph features
Feature Comment
Paragraph start position Paragraph position measured in character
Paragraph size Paragraph size measured in tokens
Words count Count tokens consisting of cyrillic and latin letters only
Numbers count Count tokens consisting of digits
Lower-cased word count Count words with all characters in lower case
Capitalized word count Count tokens with first character in upper case
Uppercased word count Count words with all characters in upper case
Dots count Count of dot characters in a paragraph
Commas count Count of comma characters in a paragraph
Starts with upper-cased word The first word of a paragraph is in upper case
Starts with capitalized word The first word of a paragraph has the first char in upper case
Starts with number The first token of paragraph is a number
Dictionary word counts Vector with each element equal to dictionary word frequency in a paragraph
Character counts Vector with each element equal to character frequency in a paragraph
Previous label Vector representing label of the previous section in the document
The dictionary of the most frequent words in all labeled regions of the training corpus is shown in Table 3.
Table 3. Dictionary of the most frequent words in labeled regions
.
університет
україни
інститут
академія
державний
національний
і
імені
наук
аль
-
анатолій
а
миколайович
михайлівна
‘
володимирівна
сергійович
микола
удк
:
0
)
(
1
2
3
та
на
в
,
у
з
–
01
05
спеціальність
00
02
4
здобуття
дисертації
наукового
ступеня
кандидата
автореферат
технічних
5
київ
харків
одеса
донецьк
львів
дніпропетровськ
виконана
робота
університеті
освіти
науки
державному
науковий
керівник
професор
доктор
опоненти
кафедри
офіційні
провідна
установа
м
кафедра
захист
ради
вченої
відбудеться
о
засіданні
спеціалізованої
можна
дисертацією
бібліотеці
ознайомитись
університету
6
розісланий
р
“
”
«
року
_
секретар
вчений
с
//
и
что
of
the
157
Моделі і засоби систем баз даних та знань
Neural network training
Window of w = 3 consecutive paragraphs is used as input to the neural network [3]. The previous label feature
is added only for the current paragraph. That gives us the input layer size of w*(Np-Nl) + Nl = 1250. Hidden layer size
was chosen empirically to be of 40 nodes. Output layer size is equal to Nl = 20, it is defined by chosen labels count.
To train the neural network, 70 documents of training corpora are converted into vectors. Due to the chosen
window of width 3 , each document is augmented with one padding paragraph at the beginning and one at the end.
For each paragraph in the document, the window consists of one paragraph before, the current paragraph, and
one paragraph after as one sample input for training. Then these three paragraphs are converted to the input vector by
concatenating their feature vectors. And the vector of the previous label feature is concatenated to these three.
A vector representing the label of the current paragraph is used for the training sample output. The vector con-
sists of 0 in each position except of 1 in the position representing the label of the current paragraph (Fig. 2).Моделі і засоби систем баз даних та знань
Fig. 2. Neural network for our example
In this way 70 documents of training dataset provide 18423 training samples. Neural network is trained with
RPROP method implemented in FANN (Fast Artificial Neural Network - http://leenissen.dk/fann/wp/ ) library [4,5], it
is an adaptive back propagation method which doesn't require to set learning rate explicitly. Mean square error is calcu-
lated once per epoch for the whole training set. It takes less than 50 epochs to achieve mean square error less than
0.001.
We used the validate set of 15 manually labeled documents to run the trained model, compare labeling results
and empirically select features to use (see Fig 3a, 3b).
Fig. 3a. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the
model
Fig. 2. Neural network for our example
In this way 70 documents of training dataset provide 18423 training samples. Neural network is trained
with RPROP method implemented in FANN (Fast Artificial Neural Network - http://leenissen.dk/fann/wp/
) library [4,5], it is an adaptive back propagation method which doesn’t require to set learning rate explic-
itly. Mean square error is calculated once per epoch for the whole training set. It takes less than 50 epochs to
achieve mean square error less than 0.001.
We used the validate set of 15 manually labeled documents to run the trained model, compare labeling results
and empirically select features to use (see Fig 3a, 3b).
Моделі і засоби систем баз даних та знань
Fig. 2. Neural network for our example
In this way 70 documents of training dataset provide 18423 training samples. Neural network is trained with
RPROP method implemented in FANN (Fast Artificial Neural Network - http://leenissen.dk/fann/wp/ ) library [4,5], it
is an adaptive back propagation method which doesn't require to set learning rate explicitly. Mean square error is calcu-
lated once per epoch for the whole training set. It takes less than 50 epochs to achieve mean square error less than
0.001.
We used the validate set of 15 manually labeled documents to run the trained model, compare labeling results
and empirically select features to use (see Fig 3a, 3b).
Fig. 3a. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the
model
Fig. 3a. Output in HTML format of test corpus documents for visual comparison:
marked up manually and using the model
158
Моделі і засоби систем баз даних та знаньМоделі і засоби систем баз даних та знань
Fig. 3b. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the
model
Results evaluation
The test dataset of another 15 manually labeled documents is used to make the final evaluation of the model
(Fig. 4). It is executed independently after the model parameters are adjusted to improve results for the test dataset.
Standard precision, recall, and F1 measures are used for evaluation. The strict check is made for the whole section of a
document to be labeled correctly, i.e. partial overlap of correct labeling only for some paragraphs in the section is con-
sidered wrong. Scores are calculated over all documents in the dataset.
Fig. 4. The train and validation datasets are used to build the model, while the test dataset is used to evaluate it
The overall F1 score averaged over all labels is 84. Detailed results can be found in Table 4. All values are mul-
tiplied by 100 for convenience.
Table 4. Trained model results evaluation. Precision, recall and F1-score per label. All numbers multiplied by 100 for
convenience
Label Precision Recall F1
MAIN_ORG 81 87 84
AUTHOR 79 73 76
UDK 100 100 100
TITLE 79 73 76
SPEC 87 93 90
DEGREE 93 100 97
CITY_YEAR 100 100 100
WORK_ORG 87 87 87
SUPERVISOR 38 92 53
OPPONENTS 80 80 80
LEAD_ORG 93 93 93
Fig. 3b. Output in HTML format of test corpus documents for visual comparison:
marked up manually and using the model
Results evaluation
The test dataset of another 15 manually labeled documents is used to make the final evaluation of the model
(Fig. 4). It is executed independently after the model parameters are adjusted to improve results for the test dataset.
Standard precision, recall, and F1 measures are used for evaluation. The strict check is made for the whole section of
a document to be labeled correctly, i.e. partial overlap of correct labeling only for some paragraphs in the section is
considered wrong. Scores are calculated over all documents in the dataset.
Моделі і засоби систем баз даних та знань
Fig. 3b. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the
model
Results evaluation
The test dataset of another 15 manually labeled documents is used to make the final evaluation of the model
(Fig. 4). It is executed independently after the model parameters are adjusted to improve results for the test dataset.
Standard precision, recall, and F1 measures are used for evaluation. The strict check is made for the whole section of a
document to be labeled correctly, i.e. partial overlap of correct labeling only for some paragraphs in the section is con-
sidered wrong. Scores are calculated over all documents in the dataset.
Fig. 4. The train and validation datasets are used to build the model, while the test dataset is used to evaluate it
The overall F1 score averaged over all labels is 84. Detailed results can be found in Table 4. All values are mul-
tiplied by 100 for convenience.
Table 4. Trained model results evaluation. Precision, recall and F1-score per label. All numbers multiplied by 100 for
convenience
Label Precision Recall F1
MAIN_ORG 81 87 84
AUTHOR 79 73 76
UDK 100 100 100
TITLE 79 73 76
SPEC 87 93 90
DEGREE 93 100 97
CITY_YEAR 100 100 100
WORK_ORG 87 87 87
SUPERVISOR 38 92 53
OPPONENTS 80 80 80
LEAD_ORG 93 93 93
Fig. 4. The train and validation datasets are used to build the model,
while the test dataset is used to evaluate it
The overall F1 score averaged over all labels is 84. Detailed results can be found in Table 4. All values are
multiplied by 100 for convenience.
Table 4. Trained model results evaluation. Precision, recall and F1-score per label. All numbers multiplied by 100 for
convenience
Label Precision Recall F1
MAIN_ORG 81 87 84
AUTHOR 79 73 76
UDK 100 100 100
TITLE 79 73 76
159
Моделі і засоби систем баз даних та знань
SPEC 87 93 90
DEGREE 93 100 97
CITY_YEAR 100 100 100
WORK_ORG 87 87 87
SUPERVISOR 38 92 53
OPPONENTS 80 80 80
LEAD_ORG 93 93 93
DEFENSE 100 100 100
LIBRARY 80 92 86
SENT 100 100 100
SECRETARY 87 87 87
PUBLICATIONS 31 33 32
ABSTRACT_UK 80 80 80
ABSTRACT_EN 100 100 100
ABSTRACT_RU 80 80 80
Average 83 87 84
Interpretation
The trained model shows best results on short document sections with consistently strong statistical text features.
The long sections that include heterogeneous paragraphs are predicted the worst, e.g. publications section consists of sec-
tion title followed by list items, and while the latter are detected pretty good on paragraph level, the section title often is
mispredicted as not having a label, and thus the whole section is considered incorrect. In general, scores are high enough for
practical applications.
Conclusions
A method of extracting high-level sections from weakly structured text documents is built. The method is based on an
artificial neural network and thus requires a training dataset. The dataset is manually labeled to build, validate and evaluate the
model. The model performs well and proves that machine learning can be successfully applied to the problem of extracting logi-
cal structure from the text documents. It is also simpler than rule-based methods that require special skills to set up the algorithm.
Future research goal is to improve scores, especially for long document sections, by modifying neural network
architecture.
References
1. KUDIM K.A., PROSKUDINA G.YU. (2019). Methods and tools for extracting personal data from theses abstracts Problems in programming.
[online – pp.isofts.kiev.ua] (2). P. 38–46. (in Russian). Available from: http://pp.isofts.kiev.ua/ojs1/article/view/359 [Accessed 04/08/2022].
2. KUDIM K.A., PROSKUDINA G.YU. (2020). A method for extracting data from semistructured documents Problems in programming. [online
– pp.isofts.kiev.ua] (1). P. 25–32. (in Russian). Available from: http://pp.isofts.kiev.ua/ojs1/article/view/388 [Accessed 04/08/2022].
3. YI HE. (2017) Extracting Document Structure of a Text with Visual and Textual Cues. University of Twente. Elsevier. 78 р. (in English). Avail-
able from: https://essay.utwente.nl/72979/1/Yi He - master thesis - final version.pdf [Accessed 05/08/2022]
4. STEFFEN NISSEN. (2005). Neural Networks Made Simple. Software 2.0. [online – software20.org] (2). P. 14–19. Available from: http://fann.
sourceforge.net/fann_en.pdf [Accessed 05/08/2022].
5. MARTIN RIEDMILLER, HEINRICH BRAUN. (1993). A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm
– Neural Networks. IEEE International Conference. P.586-591. Available from: https://paginas.fe.up.pt/~ee02162/dissertacao/RPROP paper.pdf
Received 11.08.2022
About authors:
Kudim Kuzma Alekseevich,
junior researcher of Institute of Software Systems NAS of Ukraine.
Publications in Ukrainian journals – 19.
Publications in foreign journals – 2. 1
http://orcid.org/0000-0001-9483-5495,
continuation tab. 4.
160
Моделі і засоби систем баз даних та знань
Proskudina Galyna Yurievna
researcher of Institute of Software Systems NAS of Ukraine.
Publications in Ukrainian journals – 32.
Publications in foreign journals – 15.
http://orcid.org/0000-0001-9094-1565.
Place of work:
Institute of Software Systems NAS of Ukraine,
03187, Kyiv-187,
Academician Glushkov Avenue, 40, build 5.
Phone: +38(050) 368 49 27.
E-mail: kuzmaka@gmail.com,
guproskudina@gmail.com
Прізвища та ініціали авторів і назва доповіді англійською мовою:
Kudim K.A., Proskudina G.Yu.
Extracting structure from text documents based on machine learning
Прізвища та ініціали авторів і назва доповіді українською мовою:
Кудім К.О., Проскудіна Г.Ю.
Витяг структури з текстових документів на основі машинного навчання
|