Extracting structure from text documents based on machine learning

This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents...

Повний опис

Збережено в:

Бібліографічні деталі
Дата:	2023
Автори:	Kudim, K.A., Proskudina, G.Yu.
Формат:	Стаття
Мова:	English
Опубліковано:	Інститут програмних систем НАН України 2023
Теми:	natural language processing information extraction machine learning neural network UDC 004.82
Онлайн доступ:	https://pp.isofts.kiev.ua/index.php/ojs1/article/view/517
Теги:	Додати тег Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:	Problems in programming

Репозитарії

Problems in programming

id	pp_isofts_kiev_ua-article-517
record_format	ojs
resource_txt_mv	ppisoftskievua/4f/c6cd3602963057279bf066bfc819ab4f.pdf
spelling	pp_isofts_kiev_ua-article-5172023-06-25T05:20:21Z Extracting structure from text documents based on machine learning Витяг структури з текстових документів на основі машинного навчання Kudim, K.A. Proskudina, G.Yu. natural language processing; information extraction; machine learning; neural network UDC 004.82 обробка природної мови; видобуток інформації; машинне навчання; нейронні мережі УДК 004.82 This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents, converting a variety of file formats into plain text, and manual labeling each document structure. Then documents are split into tokens and into paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network. The model is trained and validated on the selected data subsets. Trained model results evaluation is presented. The final performance is calculated per label using precision, recall, and F1 measures, and overall average. The trained model can be used to extract sections of documents bearing similar structure.Prombles in programming 2022; 3-4: 154-160 Дослідження присвячене методу, що вирішує задачу автоматичного витягу структури з слабо структурованих текстових документів за допомогою штучної нейронної мережі. Метод складається з підготовки даних, побудови та навчання моделі та оцінки результатів. Підготовка даних включає збирання корпусів документів, перетворення різних форматів файлів у звичайний текст і ручне маркування структури кожного документа. Потім документи розбиваються на слова та абзаци. Абзаци тексту представлені як вектори ознак для забезпечення вхідних даних для нейронної мережі. Модель навчена та перевірена на вибраних підмножинах даних. Представлена оцінка результатів навченої моделі. Остаточна ефективність розраховується для кожної мітки з використанням F1-оцінки, точності та повноти, а також загального середнього значення. Навчену модель можна використовувати для витягу розділів документів, що мають подібну структуру.Prombles in programming 2022; 3-4: 154-160 Інститут програмних систем НАН України 2023-01-23 Article Article application/pdf https://pp.isofts.kiev.ua/index.php/ojs1/article/view/517 10.15407/pp2022.03-04.154 PROBLEMS IN PROGRAMMING; No 3-4 (2022); 154-160 ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ; No 3-4 (2022); 154-160 ПРОБЛЕМИ ПРОГРАМУВАННЯ; No 3-4 (2022); 154-160 1727-4907 10.15407/pp2022.03-04 en https://pp.isofts.kiev.ua/index.php/ojs1/article/view/517/570 Copyright (c) 2023 PROBLEMS IN PROGRAMMING
institution	Problems in programming
baseUrl_str	https://pp.isofts.kiev.ua/index.php/ojs1/oai
datestamp_date	2023-06-25T05:20:21Z
collection	OJS
language	English
topic	natural language processing information extraction machine learning neural network UDC 004.82
spellingShingle	natural language processing information extraction machine learning neural network UDC 004.82 Kudim, K.A. Proskudina, G.Yu. Extracting structure from text documents based on machine learning
topic_facet	natural language processing information extraction machine learning neural network UDC 004.82 обробка природної мови видобуток інформації машинне навчання нейронні мережі УДК 004.82
format	Article
author	Kudim, K.A. Proskudina, G.Yu.
author_facet	Kudim, K.A. Proskudina, G.Yu.
author_sort	Kudim, K.A.
title	Extracting structure from text documents based on machine learning
title_short	Extracting structure from text documents based on machine learning
title_full	Extracting structure from text documents based on machine learning
title_fullStr	Extracting structure from text documents based on machine learning
title_full_unstemmed	Extracting structure from text documents based on machine learning
title_sort	extracting structure from text documents based on machine learning
title_alt	Витяг структури з текстових документів на основі машинного навчання
description	This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents, converting a variety of file formats into plain text, and manual labeling each document structure. Then documents are split into tokens and into paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network. The model is trained and validated on the selected data subsets. Trained model results evaluation is presented. The final performance is calculated per label using precision, recall, and F1 measures, and overall average. The trained model can be used to extract sections of documents bearing similar structure.Prombles in programming 2022; 3-4: 154-160
publisher	Інститут програмних систем НАН України
publishDate	2023
url	https://pp.isofts.kiev.ua/index.php/ojs1/article/view/517
work_keys_str_mv	AT kudimka extractingstructurefromtextdocumentsbasedonmachinelearning AT proskudinagyu extractingstructurefromtextdocumentsbasedonmachinelearning AT kudimka vitâgstrukturiztekstovihdokumentívnaosnovímašinnogonavčannâ AT proskudinagyu vitâgstrukturiztekstovihdokumentívnaosnovímašinnogonavčannâ
first_indexed	2025-07-17T09:35:54Z
last_indexed	2025-07-17T09:35:54Z
_version_	1837886294461513728
fulltext	154 Моделі і засоби систем баз даних та знань УДК 004.82 https://doi.org/10.15407/pp2022.03-04.154 EXTRACTING STRUCTURE FROM TEXT DOCUMENTS BASED ON MACHINE LEARNING Kuzma Kudim, Galyna Proskudina This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. For the method to work it requires a set of manually labeled documents to train the network. The trained model can be used to extract sections of documents bearing similar structure. Keywords: natural language processing, information extraction, machine learning, neural network. Дослідження присвячене методу, що вирішує задачу автоматичного витягу структури з слабо структурованих текстових доку- ментів за допомогою штучної нейронної мережі. Для того, щоб цей метод працював, потрібен розмічений вручну набір докумен- тів для навчання мережі. Навчену модель можна використовувати для витягу розділів документів, що мають подібну структуру. Ключові слова: обробка природної мови, видобуток інформації, машинне навчання, нейронні мережі. Introduction There are a lot of text documents that have rich representational formatting, easily readable and understand- able by human but not intended for automatic processing. Examples are scientific papers, legal documents, books. All of them have implicit logical structure like title page with title and author, publisher’s imprint, chapters, references. If we make this logical structure explicit then it can be automatically processed. And then it can be used either as meta- data describing the document or as input for further fine-grained information extraction. Here we describe a method that facilitates the task of extracting structure from the text documents using an artificial neural network. For the method to work it requires a set of manually labeled documents to train the network. The trained model can be used to extract sections of documents bearing similar structure. Previously we already described two other methods of data extraction from semi-structured text documents. One based on detecting patterns using regular expressions and another based on linguistic rules [1, 2]. Both of these methods require special skills to set up them for a particular type of documents, and to update the system for the changed structure. The method based on machine learning described here has the benefit of not requiring programming skills for usage. The initial set up requires only an accurately labeled set of documents, and this labeling can be made by any person with basic understanding of the target structure of the document in the usual sense. Overview The paper consists of three main sections, as follows. First of all, data should be prepared to train, validate and evaluate the model. Data preparation includes collect- ing corpora of documents, converting a variety of file formats into plain text, and manual labeling each document struc- ture. Finally, the dataset is split into three subsets for model training, validation and test in 70/15/15 ratio respectively. Building and training the model is the central part of the work. Document is split into tokens and then into paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network that consists of three fully connected layers. The model is trained and validated on the selected data subsets. After the model is trained showing a good F1 score on validation dataset for the selected features, it’s time to evaluate the results on a very new data, i.e. test dataset. The final performance is calculated per label using precision, recall, and F1 measures, and overall average. Data preparation Corpora. A selected subset from the thesis corpora from the National library of Ukraine by V.I.Vernadsky is used as a dataset. The whole corpora consists of nearly 65000 documents. A subset of 100 theses is selected and split into 70 documents as training set, 15 as validation set, and 15 as test dataset for final evaluation. Conversion to plain text. The selected documents are in doc and rtf formats. As a preliminary step, this variety of file formats is converted to plain text using LibreOffice (https://www.libreoffice.org) from command line as follows: soffice --headless --convert-to txt --outdir out_dir in_file Output text files are in UTF-8 encoding with BOM signature at the file start, so additionally the first three bytes of each file are removed. © К.О. Кудім, Г.Ю. Проскудіна, 2022 ISSN 1727-4907. Проблеми програмування. 2022. № 3-4. Спеціальний випуск 155 Моделі і засоби систем баз даних та знань Labeling. Our goal is to select top-level sections of the document that are potentially useful for further infor- mation extraction. That means, from one side, we are not interested in thesis main thematic content, and, from the other side, we don’t care of fine-grained data contained deeper in each section on this stage. The factual data extraction can be the next step after larger document sections are successfully extracted. 19 labels shown in Table 1 are selected to reflect the desired top level logical structure of the thesis document. Each label covers the whole section of the document, although sections can differ much in size and inner complexity. For example, a section labeled SPEC covers speciality digital code and name, or maybe a list of such records. Another section labeled PUBLICATIONS includes all listed publications as a whole section of the document. Fine-grained information extraction is out of scope of current work. Special label O is used internally to represent absence of any specific label. Table 1. Structural labels for thesis document Label Document section MAIN_ORG Organization this document is related to in general, at the top of the title page AUTHOR Thesis author UDK UDC classifier TITLE Thesis title SPEC Thesis speciality code and name DEGREE Target scientific degree of the thesis CITY_YEAR City and year in the footer of the title page WORK_ORG Author’s work organization SUPERVISOR Scientific supervisor OPPONENTS Scientific opponents LEAD_ORG Leading organization for the thesis DEFENSE Information about thesis defense event LIBRARY Where the thesis manuscript is stored SENT When participants were notified by mail SECRETARY Scientific secretary PUBLICATIONS Author’s publications for the thesis ABSTRACT_UK Abstract in Ukrainian ABSTRACT_EN Abstract in English ABSTRACT_RU Abstract in Russian O Used internally to represent empty label All 100 documents from the corpora are manually labeled using Label Studio (https://labelstud.io/) open source data labeling tool as shown in Figure 1, and exported in JSON format. Моделі і засоби систем баз даних та знань Fig. 1. Manually labeling process using Label Studio Model Feature vector representation of paragraph. Document is represented as a sequence of paragraphs, and each paragraph is converted to a feature vector of N dimensions. Paragraph features are listed in Table 2. First Ns = 12 fea- tures are quite simple, each reflecting one statistic value in a paragraph [3]. For untrivial features the explanation fol- lows. Amongst other features a vector representing dictionary word count is used. A short dictionary of Nd = 105 words is built of the most frequent words met in labeled sections, 10 most frequent words for each label over all docu- ments. The dictionary word vector is concatenated to the main feature vector. This feature adds Nd dimensions to the feature vector. The same goes for character frequencies in a paragraph. Dictionary for characters from the training set contains Nc = 293 characters. Here the paragraph is considered as a bag of characters and the frequency of each character is cal- culated. It is also concatenated to the main feature vector adding Nc dimensions. Another special feature represents a non-empty label preceding the current paragraph in the document. This fea- ture catches the global order of labels. This feature has Nl = 20 dimensions that is equal to the count of non-empty structural labels. When using a window of nearby paragraphs for model training then this global feature is concatenated only once to the input vector. How the window is used is described in the next section. From the above we can see that the feature vector representing the paragraph has Np = Ns + Nd + Nc + Nl = 430 dimensions. Specific numbers of simple features, word and character dictionary size, label count can vary depending not only on a task in question but also when optimizing trained model scores. Table 2. Paragraph features Feature Comment Paragraph start position Paragraph position measured in character Paragraph size Paragraph size measured in tokens Words count Count tokens consisting of cyrillic and latin letters only Numbers count Count tokens consisting of digits Lower-cased word count Count words with all characters in lower case Capitalized word count Count tokens with first character in upper case Uppercased word count Count words with all characters in upper case Dots count Count of dot characters in a paragraph Fig. 1. Manually labeling process using Label Studio 156 Моделі і засоби систем баз даних та знань Model Feature vector representation of paragraph. Document is represented as a sequence of paragraphs, and each paragraph is converted to a feature vector of N dimensions. Paragraph features are listed in Table 2. First Ns = 12 features are quite simple, each reflecting one statistic value in a paragraph [3]. For untrivial features the explanation follows. Amongst other features a vector representing dictionary word count is used. A short dictionary of Nd = 105 words is built of the most frequent words met in labeled sections, 10 most frequent words for each label over all documents. The dictionary word vector is concatenated to the main feature vector. This feature adds Nd dimensions to the feature vector. The same goes for character frequencies in a paragraph. Dictionary for characters from the training set con- tains Nc = 293 characters. Here the paragraph is considered as a bag of characters and the frequency of each character is calculated. It is also concatenated to the main feature vector adding Nc dimensions. Another special feature represents a non-empty label preceding the current paragraph in the document. This feature catches the global order of labels. This feature has Nl = 20 dimensions that is equal to the count of non-empty structural labels. When using a window of nearby paragraphs for model training then this global feature is concatenated only once to the input vector. How the window is used is described in the next section. From the above we can see that the feature vector representing the paragraph has Np = Ns + Nd + Nc + Nl = 430 dimensions. Specific numbers of simple features, word and character dictionary size, label count can vary depend- ing not only on a task in question but also when optimizing trained model scores. Table 2. Paragraph features Feature Comment Paragraph start position Paragraph position measured in character Paragraph size Paragraph size measured in tokens Words count Count tokens consisting of cyrillic and latin letters only Numbers count Count tokens consisting of digits Lower-cased word count Count words with all characters in lower case Capitalized word count Count tokens with first character in upper case Uppercased word count Count words with all characters in upper case Dots count Count of dot characters in a paragraph Commas count Count of comma characters in a paragraph Starts with upper-cased word The first word of a paragraph is in upper case Starts with capitalized word The first word of a paragraph has the first char in upper case Starts with number The first token of paragraph is a number Dictionary word counts Vector with each element equal to dictionary word frequency in a paragraph Character counts Vector with each element equal to character frequency in a paragraph Previous label Vector representing label of the previous section in the document The dictionary of the most frequent words in all labeled regions of the training corpus is shown in Table 3. Table 3. Dictionary of the most frequent words in labeled regions . університет україни інститут академія державний національний і імені наук аль - анатолій а миколайович михайлівна ‘ володимирівна сергійович микола удк : 0 ) ( 1 2 3 та на в , у з – 01 05 спеціальність 00 02 4 здобуття дисертації наукового ступеня кандидата автореферат технічних 5 київ харків одеса донецьк львів дніпропетровськ виконана робота університеті освіти науки державному науковий керівник професор доктор опоненти кафедри офіційні провідна установа м кафедра захист ради вченої відбудеться о засіданні спеціалізованої можна дисертацією бібліотеці ознайомитись університету 6 розісланий р “ ” « року _ секретар вчений с // и что of the 157 Моделі і засоби систем баз даних та знань Neural network training Window of w = 3 consecutive paragraphs is used as input to the neural network [3]. The previous label feature is added only for the current paragraph. That gives us the input layer size of w*(Np-Nl) + Nl = 1250. Hidden layer size was chosen empirically to be of 40 nodes. Output layer size is equal to Nl = 20, it is defined by chosen labels count. To train the neural network, 70 documents of training corpora are converted into vectors. Due to the chosen window of width 3 , each document is augmented with one padding paragraph at the beginning and one at the end. For each paragraph in the document, the window consists of one paragraph before, the current paragraph, and one paragraph after as one sample input for training. Then these three paragraphs are converted to the input vector by concatenating their feature vectors. And the vector of the previous label feature is concatenated to these three. A vector representing the label of the current paragraph is used for the training sample output. The vector con- sists of 0 in each position except of 1 in the position representing the label of the current paragraph (Fig. 2).Моделі і засоби систем баз даних та знань Fig. 2. Neural network for our example In this way 70 documents of training dataset provide 18423 training samples. Neural network is trained with RPROP method implemented in FANN (Fast Artificial Neural Network - http://leenissen.dk/fann/wp/ ) library [4,5], it is an adaptive back propagation method which doesn't require to set learning rate explicitly. Mean square error is calcu- lated once per epoch for the whole training set. It takes less than 50 epochs to achieve mean square error less than 0.001. We used the validate set of 15 manually labeled documents to run the trained model, compare labeling results and empirically select features to use (see Fig 3a, 3b). Fig. 3a. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the model Fig. 2. Neural network for our example In this way 70 documents of training dataset provide 18423 training samples. Neural network is trained with RPROP method implemented in FANN (Fast Artificial Neural Network - http://leenissen.dk/fann/wp/ ) library [4,5], it is an adaptive back propagation method which doesn’t require to set learning rate explic- itly. Mean square error is calculated once per epoch for the whole training set. It takes less than 50 epochs to achieve mean square error less than 0.001. We used the validate set of 15 manually labeled documents to run the trained model, compare labeling results and empirically select features to use (see Fig 3a, 3b). Моделі і засоби систем баз даних та знань Fig. 2. Neural network for our example In this way 70 documents of training dataset provide 18423 training samples. Neural network is trained with RPROP method implemented in FANN (Fast Artificial Neural Network - http://leenissen.dk/fann/wp/ ) library [4,5], it is an adaptive back propagation method which doesn't require to set learning rate explicitly. Mean square error is calcu- lated once per epoch for the whole training set. It takes less than 50 epochs to achieve mean square error less than 0.001. We used the validate set of 15 manually labeled documents to run the trained model, compare labeling results and empirically select features to use (see Fig 3a, 3b). Fig. 3a. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the model Fig. 3a. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the model 158 Моделі і засоби систем баз даних та знаньМоделі і засоби систем баз даних та знань Fig. 3b. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the model Results evaluation The test dataset of another 15 manually labeled documents is used to make the final evaluation of the model (Fig. 4). It is executed independently after the model parameters are adjusted to improve results for the test dataset. Standard precision, recall, and F1 measures are used for evaluation. The strict check is made for the whole section of a document to be labeled correctly, i.e. partial overlap of correct labeling only for some paragraphs in the section is con- sidered wrong. Scores are calculated over all documents in the dataset. Fig. 4. The train and validation datasets are used to build the model, while the test dataset is used to evaluate it The overall F1 score averaged over all labels is 84. Detailed results can be found in Table 4. All values are mul- tiplied by 100 for convenience. Table 4. Trained model results evaluation. Precision, recall and F1-score per label. All numbers multiplied by 100 for convenience Label Precision Recall F1 MAIN_ORG 81 87 84 AUTHOR 79 73 76 UDK 100 100 100 TITLE 79 73 76 SPEC 87 93 90 DEGREE 93 100 97 CITY_YEAR 100 100 100 WORK_ORG 87 87 87 SUPERVISOR 38 92 53 OPPONENTS 80 80 80 LEAD_ORG 93 93 93 Fig. 3b. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the model Results evaluation The test dataset of another 15 manually labeled documents is used to make the final evaluation of the model (Fig. 4). It is executed independently after the model parameters are adjusted to improve results for the test dataset. Standard precision, recall, and F1 measures are used for evaluation. The strict check is made for the whole section of a document to be labeled correctly, i.e. partial overlap of correct labeling only for some paragraphs in the section is considered wrong. Scores are calculated over all documents in the dataset. Моделі і засоби систем баз даних та знань Fig. 3b. Output in HTML format of test corpus documents for visual comparison: marked up manually and using the model Results evaluation The test dataset of another 15 manually labeled documents is used to make the final evaluation of the model (Fig. 4). It is executed independently after the model parameters are adjusted to improve results for the test dataset. Standard precision, recall, and F1 measures are used for evaluation. The strict check is made for the whole section of a document to be labeled correctly, i.e. partial overlap of correct labeling only for some paragraphs in the section is con- sidered wrong. Scores are calculated over all documents in the dataset. Fig. 4. The train and validation datasets are used to build the model, while the test dataset is used to evaluate it The overall F1 score averaged over all labels is 84. Detailed results can be found in Table 4. All values are mul- tiplied by 100 for convenience. Table 4. Trained model results evaluation. Precision, recall and F1-score per label. All numbers multiplied by 100 for convenience Label Precision Recall F1 MAIN_ORG 81 87 84 AUTHOR 79 73 76 UDK 100 100 100 TITLE 79 73 76 SPEC 87 93 90 DEGREE 93 100 97 CITY_YEAR 100 100 100 WORK_ORG 87 87 87 SUPERVISOR 38 92 53 OPPONENTS 80 80 80 LEAD_ORG 93 93 93 Fig. 4. The train and validation datasets are used to build the model, while the test dataset is used to evaluate it The overall F1 score averaged over all labels is 84. Detailed results can be found in Table 4. All values are multiplied by 100 for convenience. Table 4. Trained model results evaluation. Precision, recall and F1-score per label. All numbers multiplied by 100 for convenience Label Precision Recall F1 MAIN_ORG 81 87 84 AUTHOR 79 73 76 UDK 100 100 100 TITLE 79 73 76 159 Моделі і засоби систем баз даних та знань SPEC 87 93 90 DEGREE 93 100 97 CITY_YEAR 100 100 100 WORK_ORG 87 87 87 SUPERVISOR 38 92 53 OPPONENTS 80 80 80 LEAD_ORG 93 93 93 DEFENSE 100 100 100 LIBRARY 80 92 86 SENT 100 100 100 SECRETARY 87 87 87 PUBLICATIONS 31 33 32 ABSTRACT_UK 80 80 80 ABSTRACT_EN 100 100 100 ABSTRACT_RU 80 80 80 Average 83 87 84 Interpretation The trained model shows best results on short document sections with consistently strong statistical text features. The long sections that include heterogeneous paragraphs are predicted the worst, e.g. publications section consists of sec- tion title followed by list items, and while the latter are detected pretty good on paragraph level, the section title often is mispredicted as not having a label, and thus the whole section is considered incorrect. In general, scores are high enough for practical applications. Conclusions A method of extracting high-level sections from weakly structured text documents is built. The method is based on an artificial neural network and thus requires a training dataset. The dataset is manually labeled to build, validate and evaluate the model. The model performs well and proves that machine learning can be successfully applied to the problem of extracting logi- cal structure from the text documents. It is also simpler than rule-based methods that require special skills to set up the algorithm. Future research goal is to improve scores, especially for long document sections, by modifying neural network architecture. References 1. KUDIM K.A., PROSKUDINA G.YU. (2019). Methods and tools for extracting personal data from theses abstracts Problems in programming. [online – pp.isofts.kiev.ua] (2). P. 38–46. (in Russian). Available from: http://pp.isofts.kiev.ua/ojs1/article/view/359 [Accessed 04/08/2022]. 2. KUDIM K.A., PROSKUDINA G.YU. (2020). A method for extracting data from semistructured documents Problems in programming. [online – pp.isofts.kiev.ua] (1). P. 25–32. (in Russian). Available from: http://pp.isofts.kiev.ua/ojs1/article/view/388 [Accessed 04/08/2022]. 3. YI HE. (2017) Extracting Document Structure of a Text with Visual and Textual Cues. University of Twente. Elsevier. 78 р. (in English). Avail- able from: https://essay.utwente.nl/72979/1/Yi He - master thesis - final version.pdf [Accessed 05/08/2022] 4. STEFFEN NISSEN. (2005). Neural Networks Made Simple. Software 2.0. [online – software20.org] (2). P. 14–19. Available from: http://fann. sourceforge.net/fann_en.pdf [Accessed 05/08/2022]. 5. MARTIN RIEDMILLER, HEINRICH BRAUN. (1993). A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm – Neural Networks. IEEE International Conference. P.586-591. Available from: https://paginas.fe.up.pt/~ee02162/dissertacao/RPROP paper.pdf Received 11.08.2022 About authors: Kudim Kuzma Alekseevich, junior researcher of Institute of Software Systems NAS of Ukraine. Publications in Ukrainian journals – 19. Publications in foreign journals – 2. 1 http://orcid.org/0000-0001-9483-5495, continuation tab. 4. 160 Моделі і засоби систем баз даних та знань Proskudina Galyna Yurievna researcher of Institute of Software Systems NAS of Ukraine. Publications in Ukrainian journals – 32. Publications in foreign journals – 15. http://orcid.org/0000-0001-9094-1565. Place of work: Institute of Software Systems NAS of Ukraine, 03187, Kyiv-187, Academician Glushkov Avenue, 40, build 5. Phone: +38(050) 368 49 27. E-mail: kuzmaka@gmail.com, guproskudina@gmail.com Прізвища та ініціали авторів і назва доповіді англійською мовою: Kudim K.A., Proskudina G.Yu. Extracting structure from text documents based on machine learning Прізвища та ініціали авторів і назва доповіді українською мовою: Кудім К.О., Проскудіна Г.Ю. Витяг структури з текстових документів на основі машинного навчання

Extracting structure from text documents based on machine learning

Репозитарії

Схожі ресурси