Automatic generation of semantic knowledge networks from an unstructured text

A method and an algorithm for the semantic knowledge network automated construction created from the most informative concepts in the electronic texts are proposed. Аn analysis and comparison of existing methods with their software implementations for information research in electronic texts are pre...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Datum:	2018
Hauptverfasser:	Savchenko, M.N., Kriachok, A.S.
Format:	Artikel
Sprache:	English
Veröffentlicht:	Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України 2018
Schriftenreihe:	Управляющие системы и машины
Schlagworte:	Методы и средства обработки данных и знаний
Online Zugang:	http://dspace.nbuv.gov.ua/handle/123456789/142078
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!
Назва журналу:	Digital Library of Periodicals of National Academy of Sciences of Ukraine
Zitieren:	Automatic generation of semantic knowledge networks from an unstructured text / M.N. Savchenko, A.S. Kriachok // Управляющие системы и машины. — 2018. — № 1. — С. ХХ-Х71-78Х . — Бібліогр.: 10 назв. — англ.

Institution

Digital Library of Periodicals of National Academy of Sciences of Ukraine

id	irk-123456789-142078
record_format	dspace
spelling	irk-123456789-1420782018-09-25T01:23:02Z Automatic generation of semantic knowledge networks from an unstructured text Savchenko, M.N. Kriachok, A.S. Методы и средства обработки данных и знаний A method and an algorithm for the semantic knowledge network automated construction created from the most informative concepts in the electronic texts are proposed. Аn analysis and comparison of existing methods with their software implementations for information research in electronic texts are presented. The results of BBC news article analysis using the proposed method are given. Цель статьи: создание алгоритмической и программной базы для построения семантических сетей знаний из самой релевантной по отношению к контексту документов информации. Методы: предложены комплексная методика, алгоритм и его реализация для построения семантической сети знаний из самой значимой информации в заданных текстах. Предложенный комплексный алгоритм сочетает в себе работу нескольких алгоритмов на основе нейронных сетей и статистического анализа. Комбинация данных алгоритмов позволяет распознавать концепты в тексте, находить между ними связи и определять, какие из концептов должны быть включены в результирующую семантическую сеть с помощью оценки их веса в заданном контексте. Результат: проведён анализ большого текстового корпуса, общей численностью около миллиона слов. На основе собранной информации с помощью разработанного алгоритма и рекурсивной грамматики естественного языка построено семантическую сеть знаний для нескольких текстов и отдельную совмещённую семантическую сеть знаний. Проведено сравнение недостатков и преимуществ разработанного алгоритма по отношению к нескольким уже существующих подходам извлечения знаний из текстов. Продемонстрированы полученные результаты. Мета статті – створення алгоритмічної і програмної бази для побудови семантичних мереж знань із найбільш релевантної інформації відносно контексту документів. Методи: Запропоновано комплексну методику, алгоритм та його реалізацію для побудови семантичної мережі знань із найбільш значимої інформації у заданих текстах. Запропонований комплексний алгоритм поєднує в собі роботу кількох алгоритмів на основі нейронних мереж та статистичного аналізу. Комбінація даних алгоритмів дозволяє розпізнавати концепти в тексті, знаходити між ними зв’язки та визначати, які із концептів мають бути включені до результуючої семантичної мережі за допомогою оцінки їх ваги. Результат: Проведено аналіз великого текстового корпусу, загальною чисельністю близько мільйону слів. На основі зібраної інформації за допомогою розробленого алгоритму і рекурсивної граматики природної мови побудовано семантичну мережу знань для декількох текстів і окрему поєднану семантичну мережу знань. Проведено порівняння недоліків і переваг розробленого алгоритму по відношенню до кількох вже існуючих підходів вилучення знань із текстів. Продемонстровано отримані результати. 2018 Article Automatic generation of semantic knowledge networks from an unstructured text / M.N. Savchenko, A.S. Kriachok // Управляющие системы и машины. — 2018. — № 1. — С. ХХ-Х71-78Х . — Бібліогр.: 10 назв. — англ. 0130-5395 DOI: https://doi.org/10.15407/usim.2018.01.071 http://dspace.nbuv.gov.ua/handle/123456789/142078 004.89 en Управляющие системы и машины Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України
institution	Digital Library of Periodicals of National Academy of Sciences of Ukraine
collection	DSpace DC
language	English
topic	Методы и средства обработки данных и знаний Методы и средства обработки данных и знаний
spellingShingle	Методы и средства обработки данных и знаний Методы и средства обработки данных и знаний Savchenko, M.N. Kriachok, A.S. Automatic generation of semantic knowledge networks from an unstructured text Управляющие системы и машины
description	A method and an algorithm for the semantic knowledge network automated construction created from the most informative concepts in the electronic texts are proposed. Аn analysis and comparison of existing methods with their software implementations for information research in electronic texts are presented. The results of BBC news article analysis using the proposed method are given.
format	Article
author	Savchenko, M.N. Kriachok, A.S.
author_facet	Savchenko, M.N. Kriachok, A.S.
author_sort	Savchenko, M.N.
title	Automatic generation of semantic knowledge networks from an unstructured text
title_short	Automatic generation of semantic knowledge networks from an unstructured text
title_full	Automatic generation of semantic knowledge networks from an unstructured text
title_fullStr	Automatic generation of semantic knowledge networks from an unstructured text
title_full_unstemmed	Automatic generation of semantic knowledge networks from an unstructured text
title_sort	automatic generation of semantic knowledge networks from an unstructured text
publisher	Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України
publishDate	2018
topic_facet	Методы и средства обработки данных и знаний
url	http://dspace.nbuv.gov.ua/handle/123456789/142078
citation_txt	Automatic generation of semantic knowledge networks from an unstructured text / M.N. Savchenko, A.S. Kriachok // Управляющие системы и машины. — 2018. — № 1. — С. ХХ-Х71-78Х . — Бібліогр.: 10 назв. — англ.
series	Управляющие системы и машины
work_keys_str_mv	AT savchenkomn automaticgenerationofsemanticknowledgenetworksfromanunstructuredtext AT kriachokas automaticgenerationofsemanticknowledgenetworksfromanunstructuredtext
first_indexed	2025-07-10T14:06:08Z
last_indexed	2025-07-10T14:06:08Z
_version_	1837269122059075584
fulltext	ISSN 0130-5395, УСиМ, 2018, № 1 71 DOI: https://doi.org/10.15407/usim.2018.01.071 UDK 004.89 M. SAVCHENKO, Student (Masters), zitros.lab@gmail.com O. KRIACHOK, Candidate of Engineering’s Sciences, Associate Professor, alexandrkriachok@gmail.com National Technical University of Ukraine ' Igor Sikorsky Kyiv Polytechnic Institute', Kyiv, boulevard of Victory 37, corps 5, 03056. AUTOMATIC GENERATION OF SEMANTIC KNOWLEDGE NETWORKS FROM AN UNSTRUCTURED TEXT A method and an algorithm for the semantic knowledge network automated construction created from the most informative concepts in the electronic texts are proposed. Аn analysis and comparison of existing methods with their software implementations for in- formation research in electronic texts are presented. The results of BBC news article analysis using the proposed method are given. Keywords: building semantic networks, knowledge extraction, knowledge models, natural language processing Introduction In today's information environment, through the huge amount of unstructured text information, there is a need in the search, seizure, formalizing and processing of the most essential knowledge laid down by the authors in the texts. Such knowl- edge may be hidden in the concepts presented in the document, and the characteristic relationship between those concepts. Considering the large number of texts, such as news articles or scientific publications, one can notice that each text has a certain, sense-unique characteristic only for the given text. Only after reading the text, one can briefly describe it, head- ing to highlight the most important concepts in it and combine their logical relations with other known concepts. For the rest of this document we will call every single text corpus as a document, a single word in it as a term and a word or group of words that represent a specific entity as a concept. When it comes to a huge data set, Big Data or massive text corpus, just a single person can’t read through it quickly, taking all the important infor- mation from there. There is a need to structure the knowledge in texts and present texts in a struc- tured form that can be quickly analyzed. There are various methods of search and evaluation of the relevant information in the document on which one is about to make a decision using an in-depth analysis of certain parts. Additional statistical es- timation methods (weighing) and the identifica- tion of the most relevant terms in the particular document include the following: [1]. Ranging the terms ti in the document Dj by the number of a particular concept occurrences ti Dj — TF, which stands for Term Frequency. It is sta- tistically investigated that if the document de- scribes the specific area of knowledge, the most typical concepts in a particular document are re- peated relatively a lot of times. Ranging the terms ti in the document Dj by the number of occurrences of a specific concept and inversely related to the total number of all other documents  ' jD j i , ti Dj –TF-IDF, which stands for Term Frequency — Inverse Document M. Savchenko, O. Kriachok 72 ISSN 0130-5395, Control systems and computers, 2018, № 1 Frequency. In contrast to the TF, TF-IDF allows us to dramatically reduce the rank of terms found in almost all the documents, such as, for example, articles, prepositions, and other insignificant commonly used words. Other methods, such as, for example, Okapi BM25, sigma method, etc. The above methods allow us to highlight the most relevant concepts for a specific text. Using this information, when the relations between con- cepts are established, it can be dropped in case of these concepts a little weight hence giving priority to the concepts with the highest score. To examine the documents and obtain the for- malized representation of unstructured informa- tion that they contain, it is proposed to construct a semantic network of concepts and relationships between them. There are many approaches to the construction of semantic networks, and in this article only the most common ones are described. Matrix method. To identify the most expressed relationship between two terms {ti, tji  j }, we can use the matrix method. In this method, at first, all terms are weighted by one of weighting methods (for example, TF-IDF), and each term ti is assigned its unique corresponding weight pi. Af- ter selecting n the most highly expressed set of terms, the adjacency matrix A is constructed and filled with zeros. Next, the document is divided into groups gk . Each group may be represented, for example, as one individual sentence. The adjacency matrix for the matrix method completes as follows. For each group gk there is a pair of terms {ti, tji  j, ti gk , tj gk}. The number of groups into which a pair of these terms are car- ried, {ti , tji  j } is added to the adjacency matrix A. In this case, we can also consider the weight of such pair of terms {ti , tji  j } and use it as a coefficient for the resulting value in the adjacency matrix A. To demonstrate the results, this article will use the text corpus of approximately 2500 BBC News articles in English, presented in a separate text document containing the title and full text of arti- cles. Articles are divided into 5 main categories: business, politics, sport, entertainment and tech- nology. Around 500 documents match each of the categories. For in-depth analysis we will use the article «Ink helps drive democracy in Asia», pub- lished February 19, 2005 on the website BBC News [2]. An algorithm for constructing semantic net- work using the matrix method [6]: Take an evaluation (weighting) of each term in a given text in relation to all other terms in the text body by using one of the above methods (we will use TF-IDF). Create a list of unique terms that are graded in descending order of their weight in the given text. 30 the most weighted terms are selected. In auto- matic mode, the analysis can be selected, for ex- ample, 3% of the most highly ranked terms. Adjacency matrix is constructed by the algo- rithm described above. The result of building is listed in the file with the extension *.csv and visualizing via Gephi visu- alization tool. The resulting graph is shown in Figure 1. The above graph visualizes terms which have a connection with other terms in the text. The more text has the direct links between the two terms, the thicker is the connection between these terms on a graph. The size of a concept on a graph is also pro- portional to its weight of the weighting method. It may be noted that concepts are mainly or- ganized in separate clusters. For example, among other clusters, it can be noted that the most ex- pressed terms are «voter enter uv station polling». Undoubtedly, this cluster expresses the most rele- vant part of the knowledge of the text, as com- pared to all other texts. Among the other most expressed clusters, such as «upcoming presidential parliamentary elections» or «Kyrgyz elections use Fig. 1. Analysis of entities in the article using the matrix method Automatic generation of semantic knowledge networks from an unstructured text ISSN 0130-5395, УСиМ, 2018, № 1 73 ultraviolet ink», you can get a clear idea of the na- ture and content of the text. Matrix method includes the most well-inter- connected terms, but one of the major disadvan- tages of this method can be regarded as the relative complexity of the allocating clusters which are the most expressed in the graph. It is not hard to do, looking at the graph visually with a small number of terms, but as the number of connections grows, it becomes very difficult to analyze and find such clusters, even using the software. In addition, for the resulting graph it is almost impossible to include such terms that have relatively low weight accord- ing to the algorithm of entities evaluation. This plays a key role for the data crawler. Horizontal visibility graph application. Another interesting method for constructing semantic net- work of terms is the method proposed by Lan- de D.V., which combines features of the graph with the horizontal visibility evaluation methods in terms of a single text [3]. This method can be applied not only to build a network of basic con- cepts of terminology in the text, but also to build a semantic network as a whole. An algorithm for constructing a semantic net- work for this method is as follows: Text entities, similarly to previous method are evaluated (weighed) against relation to all other terms in the text body with TF-IDF method. The algorithm for constructing the horizontal visibility graph applies to the values of the term weights. The horizontal axis is taken the term po- sition in the text, and the vertical as a term weight. Before constructing the actual result, some proce- dures like stemming and rejection of the terms listed in the list of stop words are performed. Figure 2 shows the principle of the visibility graph construction with the horizontal normalized TF-IDF values of evaluation. After term weights are put on the horizontal axis, for each term ti in the document D a «horizontal search» applies to the corresponding document using a horizontal search algorithm [4]. Thus, two terms ti and tj are compared and, a bond is formed there between. The weight of this connection in the graph may be proportional to a predetermined «farsightedness» (horizon). In other words, terms that are adjacent in the document form a strong bond, thereby form a strong relation between concepts. One of the possible semantic network option construction for a given text using horizontal visi- bility graph is shown in Figure 3. This graph has been built with a threshold value of 0.25, which means that only those terms whose relative weight is greater than 25% get to the final result. The visi- bility horizon was set to 20 words, and the weight of a definite connection between the two periods was measured by the total amount of a linear dis- tance between two terms. Building a semantic network based on the ap- proach of the horizontal visibility graph construc- tion has an advantage in the formation of stable relations between concepts, as it allows us to ex- plicitly highlight concepts and existing links be- tween them through other concepts. But in this method, without modifications, the same disad- vantages apply as in the previous one, matrix method: is difficult to determine the logical order of relationships between entities and, in addition, as in the matrix method, some medium or low valued words can be missed. InterSystems iKnow technology. InterSystems Corporation is developing its own proprietary al- gorithms for in-depth analysis of the texts that also Fig. 2. The principle of the horizontal visibility graph con- struction Fig. 3. The article analysis using the horizontal visibility graph algorithm M. Savchenko, O. Kriachok 74 ISSN 0130-5395, Control systems and computers, 2018, № 1 can be used for the construction of semantic net- works of words [7]. A set of tools for application analysis of texts, included in the InterSystems iKnow corporate product, allows to identify con- cepts in the text, the relation of similarity and connecting relation between these concepts. Any solutions can be built based on these tools for «structuring» unstructured information, such as, for example, a solution for semantic analysis of sentences, revealing modern trends through analysis of the news and so on. The algorithm by which InterSystems iKnow finds the relationship between the concepts in the text, as well as similar concepts is mainly based on the use of the stable structures and words from natural language [8]. For example, some of the concepts in the text are resilient — they are chang- ing rapidly, and there are too many to remember to find out the role of each concept in the text. But there are concepts which can be considered as relatively stable between different ages, for exam- ple: «no», «replace», «performing», «it is», «use», «used by», «stored in the» and so on. InterSystems iKnow recognizes such patterns for the specified language in the text and distinguishes separate concepts ratio there between from other minor parts of speech. Additional metrics are also com- puted for concepts, like the total numbers of con- cepts, their spread (the average distance between same concepts in the text), relevance and the total score, which is combined from the previous met- rics by the formula. To demonstrate the algorithm in action, as an example, let’s take the sentence «clever cat eats cheese and breathes on a mouse burrows». iKnow technology will first consider the text as a set of sustainable language constructs, i.e. «__ __ eats and breathes on a __». After that, instead of «__» the other concepts are considered, whose presence there is almost guaranteed. The concepts also dif- fer in terms of similarity, such that «smart cat» is similar to the concept of «cat» and «mouse holes» through «holes». Thus, the article is analyzed and such concepts as, for example, «readers» and similar «ultraviolet readers», «effort» and «general effort» are identi- fied. Figure 4 depicts a graph — word semantic network constructed on the basis of the article for analysis by InterSystems iKnow technology and visualized using the visualization tool iKnow En- tity Browser. Note that the main concept of the graph is the «ink» – «Ink», from which arrows show which concepts it is related in the ink. The concept of the circle corresponds to the size of its assessment, which holds iKnow. The arrows com- ing out of these concepts reflect the «similar» con- cepts. Thus, using iKnow toolkit can not only identify the most relevant concepts for this text, but also relationships between them and the nature of that relationships (ratio for denying the similarities). Importantly, InterSystems iKnow concept is built mainly for working with entities, in fact, that basically requires a modern market. After gather- ing the necessary data about a particular concept using iKnow tools further expert should independ- ently find all relevant entities in the text and to analyze individual sentences to update the basic knowledge that has been assigned to him. In addi- tion, the technology is closed and is supplied with the main product of InterSystems' — DBMS Cachй (since 2018 — IRIS platform). The complex method of constructing a seman- tic network. Taking into account all the advan- tages of the above-described methods for building semantic web (knowledge network) for a single text, the complex approach has been developed Fig . 4. The article analysis using iKnow technology with iKnow Entity Browser Automatic generation of semantic knowledge networks from an unstructured text ISSN 0130-5395, УСиМ, 2018, № 1 75 and investigated that includes all the benefits of the methods described above and adds its own. This method combines algorithms for evalua- tion of words with algorithms Part of Speech Tag- ging (POS Tagging, identification of parts of speech) that allows you to find the concepts with their characteristics in the text and to build rela- tionships between them based solely on informa- tion obtained from the text and a small number of basic rules for a single language. Also, this method is based on constructing a full semantic network (knowledge network) that con- tains the logical connection between the called and the concepts, as opposed to the simple binary «yes-no» relationships. A simplified algorithm for constructing a se- mantic network using the complex method looks as follows: Without a change in the original text, the iden- tification of parts of speech (POS Tagging) is ap- plied. It is important not to spend Stemming or normalization of terms at this stage, as for the identification of parts of speech the semantic meaning of the original text is important, as well as the register of symbols, punctuation marks, etc. As a result, each period is recorded in the pair identified with his part of the language. Each term in the text gets its weight assigned, reduced to the normal form (lowercase Stemming for complex languages). On the basis of the information received, concepts are determined in this text and, based on a language rules, relationships between concepts are added. A semantic network is built Steps 1–2 are presented in Figure 5. The timing marks are shown in a normalized form. The blue columns in the picture identify the concepts, namely, nouns and their corresponding adjectives (ink, Kyrgyz republic). Red bars represent verb (helps, is using). For convenience, let's call them functional terms. Green represents the auxiliary parts of speech (the, of, in, to). Figure 5 shows that functional terms are almost always between two non-functional terms, i.e. concepts. Thanks to the definition of the parts of speech, it is possible to build the relationships be- tween concepts from the given information. For example, on Figure 5 it is clearly observed two such relationships: «ink helps drive democracy» and «former soviet republic is using invisible ink». Thus, the algorithm picks three concepts: «(invisi- ble) ink», «drive democracy», «(former soviet) republic», as well as the relationship between the two functional concepts of «helps» and «is using». This idea of functional terms allocation is a few similar to the idea used in the InterSystems iKnow technology, but it is not identical. In contrast with InterSystems iKnow, complex method uses POS Tagging for identification of such terms but not pre-defined dictionaries. In turn, POS Tagging algorithm implementation may vary, including specially modified for a particular task, or one in which the neural network applied, which can greatly improve the accuracy of output results, even in the case of adding new words to the lan- guage or in case of mistakes in the text. Algorithms for identifying parts of speech can now correctly recognize part of speeches with more than 97% accuracy [5], which is an accept- able value for constructing semantic network based on them. To review and search for «neighboring» concepts and relationships, like in method with horizontal visibility graph, only those terms are taken which relative score is not less than a certain value. To construct the graph shown in Figure 6, the value of 0.25 or 25% was used. Increasing this value will add to the graph less relevant concepts, but also, in turn, increase the number of explicit relations in it and build logic circuits, which may lead to the dis- covery of new logical sequences in the text. Unlike horizontal search using horizontal visi- bility graph, the complex method provides a com- Fig. 5. Evaluated words and identified parts of speech M. Savchenko, O. Kriachok 76 ISSN 0130-5395, Control systems and computers, 2018, № 1 prehensive way of breadth-first search (or searches extension of scope), starting from the term and moving in a both sides. An important advantage of this approach is that the graph counts also those entities whose weight is relatively low, but only those that have a certain relationship with a highly weighted concept through the functional word. For example, in the bar chart shown in Figure 5, the concept «drive democracy» has a rating less than 25% but, as is evident from Figure 6, it is present in the graph, because it is associated with more highly regarded concept «ink». As a result of the depth-first search of concepts, starting from functional terms, a natural relation- ship between the concepts is formed in the seman- tic network, which does not distort the meaning of the original proposal. Moreover, the identical con- cepts are combined with many other concepts through a variety of relationships, it creates the pos- sibility of a new way to treat the knowledge that has been invested in the text, without further distortion. Figure 6 shows the results of complex text analysis method based on a BBC article «Ink helps drive democracy in Asia» [2]. Note that the ob- tained knowledge graph wasn’t manually modified (except for the visual arrangement of the elements on the graph, which is performed using Gephi visualization tool).That is, it was built in a fully automatic way. To provide a broader example of the developed complex knowledge graph building algorithm, we built knowledge graph of another article, «Turkey turns on the economic charm», published by BBC in approximately the same time frame comparing to «Ink helps drive democracy in Asia». The sec- ond knowledge graph is presented on Figure 7. Moreover, to demonstrate that the retrieved data is accumulated and properly related using the same algorithm on multiple articles, we built a single knowledge graph from the two articles men- tioned above, both published in BBC. The aggre- gated result is presented on Figure 8. Joining multiple graphs into a single graph, which is presented on Figure 8 can be improved in future to consider filtering of even less relevant entities. For example, a set of texts may have many words in common. These common words can be considered when weighting concepts, by modifying TF-IDF metric with new parameters and hence cleaning the knowledge graph. The algorithm, its implementation and the source code are published online [10], where one can find detailed instructions of how to reproduce the presented result and use this algorithm to gen- erate semantic knowledge networks for new texts. Conclusion. Complex method of the semantic networks combines construction all the advan- tages of the methods described in this article and does not inherit their drawbacks. With the help of Fig. 6. A semantic network constructed using the complex method Fig. 7. A semantic network built from another BBC article Fig. 8. A combined semantic network built from the both articles Automatic generation of semantic knowledge networks from an unstructured text ISSN 0130-5395, УСиМ, 2018, № 1 77 complex method, one can build semantic net- works (knowledge graphs) from any number of texts in a fully automatic mode without the need of the system experts. In the results of data extrac- tion from the basic text, the most relevant infor- mation presented as a graph knowledge can be used in the future, e.g., for the development of automatic intelligent analysis of any text data. Designed information extraction algorithm, which considers only the most relevant informa- tion in the given texts is flexible and easily extend- able. With just a few basic rules for the given lan- guage in the text, on average complex algorithm covers more than 50% of all entities in the English text. Within each rule, this percentage increases and could theoretically reach 100%, which is also reflected in the success of parts of speech recogni- tion algorithms. By developing a set of rules for a particular language, it can be widely applied to any texts, and is not limited to only technical lit- erature, but even to the individual texts written in a particular style of information representation. REFERENCES 1. Savchenko M.M., Kriachok O.S. Using models of knowledge for the analysis of unstructured text. Modern problems of scientific support for Energy. Proc. XV Int. scientific-practical conf. of graduate students, undergraduates and stu- dents. Kiev, Apr. 25–28, 2017, Igor Sikorsky KPI. 2. P. 121. 2. Mikosz D. Ink helps drive democracy in Asia. David Mikosz. 2005. Resource Access: http://news.bbc.co.uk/2/hi/technology/4276125.stm. 3. Lande D.V. Building of Networks of Natural Hierarchies of Terms Based on Analysis of Texts Corpora. E-preprint ArXiv 1405.6068. 4. Lande D.V., Snarskii A.A., Yagunova E.V., Pronoza E.V. The Use of Horizontal Visibility Graphs to Identify the Words that Define the Informational Structure of a Text. 12th Mexican International Conference on Artificial Intelligence, 2013. P. 209–215. 5. Toutanova K., Klein D., Manning C.D., Singer Y. Feature-rich part-of-speech tagging with a cyclic dependency net- work. In: NAACL 3. (2003), P. 252–258. 6. Lande D.V., Snarskii A.A., Bezsudnov I.V. Internetika. Navigation in complex networks. Moscow: “LIBROKOM” Book House, 2009. 264 p. 7. Van Hyfte, M. Bouzinier, M. Tsatsulin, S. Richards, K. Lee, C. Almond. “Mining medical texts for cancer intelligence using iKnow”. Cancer Outcomes Conference in 2013. 8. Bronselaer G. De Tre, “Concept-Relational Text Clustering”, international journal of intelligent systems, 2012, vol. 27, 970–993 (2012). 9. De Boe M., Bouzinier D., Van Hyfte. “Extending the PMML Text Model for Text Categorization”, PMML workshop @ KDD13, August November 2013, Chicago, Illinois, USA. 10. Savchenko M. Auto Semantic Knowledge Network Builder. 2017, https://github.com/ZitRos/edu-semantic- knowledge-network-auto-builder. Received 21.02.2018 М.М. Савченêо, стóдент (маãістр), zitros.lab@gmail.com О.С. Êрячоê, ê.т.н., доцент, alexandrkriachok@gmail.com Національний технічний óніверситет Óêраїни «Êиївсьêий політехнічний інститóт ім. Іãоря Сіêорсьêоãо». м. Êиїв, пр. Перемоãи, 37, êорп. 5, 03056. АВТОМАТИЧНА ПОБÓДОВА СЕМАНТИЧНИХ МЕРЕЖ ЗНАНЬ ІЗ НЕСТРÓÊТÓРОВАНИХ ТЕÊСТІВ Встóп. Ó сóчасномó інформаційномó просторі через величезнó êільêість нестрóêтóрованої теêстової інформації існóє потреба пошóêó, вилóчення, формалізації та обробêи найбільш сóттєвих знань, заêладених автором ó теêст. Таêими знаннями можóть бóти êонцепти, представлені в доêóментах, та хараêтерні відносини між ними. Êож- ний теêст бóдь-яêоãо теêстовоãо êорпóсó несе певний óніêальний зміст, хараêтерний лише для даноãо теêстó. Аêтóальною задачею є розробêа алãоритмічної та проãрамної бази, яêа дозволяла б обробляти лише найбільш змістовнó частинó теêстів та вилóчати з неї знання, релевантні для даноãо êонтеêстó. Мета статті. Створення алãоритмічної і проãрамної бази для побóдови семантичних мереж знань з най- більш релевантної інформації відносно êонтеêстó доêóментів. M. Savchenko, O. Kriachok 78 ISSN 0130-5395, Control systems and computers, 2018, № 1 Методи. Запропоновано êомплеêснó методиêó, алãоритм та йоãо реалізацію для побóдови семантичної ме- режі знань з найбільш значної інформації ó заданих теêстах. Запропонований êомплеêсний алãоритм поєднóє роботó êільêох алãоритмів на основі нейронних мереж та статистичноãо аналізó. Êомбінація даних алãоритмів дозволяє розпізнавати êонцепти в теêсті, знаходити між ними зв’язêи та визначати, яêі з êонцептів мають бóти вêлючені до резóльтóючої семантичної мережі шляхом оцінêи їх ваãи. Резóльтат. Проведено аналіз велиêоãо теêстовоãо êорпóсó, заãальною чисельністю близьêо мільйона слів. На основі зібраної інформації за виêористання розробленоãо алãоритмó і реêóрсивної ãраматиêи природної мо- ви побóдовано семантичнó мережó знань для деêільêох теêстів і оêремó поєднанó семантичнó мережó знань. Проведено порівняння недоліêів і переваã розробленоãо алãоритмó відносно êільêох існóючих підходів вилóчен- ня знань з теêстів. Продемонстровано резóльтати. Висновоê. Êомплеêсний метод побóдови семантичних мереж поєднóє всі переваãи описаних в статті методів і не наслідóє їх основних недоліêів. Êомплеêсним методом можна бóдóвати семантичні мережі (ãрафи знань) з теêстів ó повністю автоматичномó режимі та без втрóчання еêспертів. Резóльтати вилóчення з теêстів основної, найбільш релевантної інформації, представленої ó виãляді ãрафó знань можна виêористовóвати в подальшомó, наприêлад, для розробêи систем автоматичноãо інтелеêтóальноãо аналізó бóдь-яêих теêстових даних. Êлючовi слова: побóдова семантичних мереж, вилóчення знань, моделi знань, обробêа природної мови Н.Н. Савченêо, стóдент (маãистр), zitros.lab@gmail.com А.С. Êрячоê, ê.т.н., доцент, alexandrkriachok@gmail.com Национальный техничесêий óниверситет Óêраины «Êиевсêий политехничесêий инститóт им. Иãоря Сиêорсêоãо». ã. Êиев, пр. Победы, 37, êорпóс 5, 03056. АВТОМАТИЧЕСÊОЕ ПОСТРОЕНИЕ СЕМАНТИЧЕСÊОЙ СЕТИ ЗНАНИЙ ИЗ НЕСТРÓÊТÓРИРОВАННЫХ ТЕÊСТОВ Введение. В связи с наличием в современном информационном пространстве оãромноãо êоличества нестрóêтó- рированной теêстовой информации сóществóет потребность в поисêе, изъятии, формализации и обработêе наи- более сóщественных знаний, заложенных авторами в теêсты. Таêими знаниями моãóт быть êонцепты, представ- ленные в доêóментах, и хараêтерные отношения междó ними. Êаждый теêст любоãо теêстовоãо êорпóса несёт определённый óниêальный смысл, хараêтерный тольêо для данноãо теêста. Аêтóальная задача — разработêа алãоритмичесêой и проãраммной базы, позволяющей обрабатывать тольêо наиболее содержательнóю часть теê- стов и извлеêать из нее знания, релевантные для данноãо êонтеêста. Цель статьи. Создание алãоритмичесêой и проãраммной базы для построения семантичесêих сетей знаний из релевантной относительно êонтеêста доêóментов информации. Методы. Предложены êомплеêсная методиêа, алãоритм и еãо реализация для построения семантичесêой сети знаний из самой значимой информации в заданных теêстах. Данный êомплеêсный алãоритм сочетает рабо- тó несêольêих алãоритмов на основе нейронных сетей и статистичесêоãо анализа. Êомбинация этих алãоритмов позволяет распознавать êонцепты в теêсте, находить междó ними связи и определять, êаêие из êонцептов долж- ны быть вêлючены в резóльтирóющóю семантичесêóю сеть с помощью оценêи их веса в заданном êонтеêсте. Резóльтат. Проведен анализ большоãо теêстовоãо êорпóса, общей численностью оêоло миллиона слов. На основе собранной информации с помощью разработанноãо алãоритма и реêóрсивной ãрамматиêи естественноãо языêа построена семантичесêая сеть знаний для несêольêих теêстов и отдельная совмещенная семантичесêая сеть знаний. Проведено сравнение недостатêов и преимóществ разработанноãо алãоритма относительно не- сêольêих сóществóющих подходов извлечения знаний из теêстов. Продемонстрированы резóльтаты. Выводы. Êомплеêсный метод построения семантичесêих сетей сочетает все преимóщества описанных ме- тодов и не наследóет их основных недостатêов. Êомплеêсным методом можно строить семантичесêие сети (ãра- фы знаний) из теêстов в полностью автоматичесêом режиме без вмешательства эêспертов. Резóльтаты извлече- ния из теêстов основной, наиболее релевантной информации, представленной в виде ãрафа знаний, можно ис- пользовать в дальнейшем, например, для разработêи систем автоматичесêоãо интеллеêтóальноãо анализа любых теêстовых данных. Êлючевые слова: построение семантичесêих сетей, извлечение знаний, модели знаний, обработêа естественноãо языêа

Automatic generation of semantic knowledge networks from an unstructured text

Institution

Ähnliche Einträge