A method for extracting data from semis-tructured documents

Linguistic method to solve the problem of data extraction from weakly structured documents is developed, approved, and described in detail in the paper. Sample data were taken from thesis catalogue of Vernadsky National Library of Ukraine. The sequence of all stages is described: document collection...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2020
Автори: Kudim, K.A., Proskudina, G.Yu.
Формат: Стаття
Мова:rus
Опубліковано: Інститут програмних систем НАН України 2020
Теми:
Онлайн доступ:https://pp.isofts.kiev.ua/index.php/ojs1/article/view/388
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Problems in programming

Репозитарії

Problems in programming
Опис
Резюме:Linguistic method to solve the problem of data extraction from weakly structured documents is developed, approved, and described in detail in the paper. Sample data were taken from thesis catalogue of Vernadsky National Library of Ukraine. The sequence of all stages is described: document collection choice; document preparation; writing grammar rules for data extraction from text; writing rules for morphology verification; creation of interpretations or bindings rules to data; analysis of parsing results. Linguistic method of data extraction showed many advantages in comparison to the method of data extraction with regular expressions described earlier.Problems in programming 2020; 1: 25-32