Metric and algorithm for similarity between two temporal event sequences calculation

In several data analysis applications on temporal events flows, the problem of measuring "similarity" of these sequences arises. There are many different definitions of event sequences but in this paper under the term "sequence of events" the ordered array of the event occurrence...

Повний опис

Збережено в:

Бібліографічні деталі
Дата:	2017
Автор:	Nikolaiev, S.
Формат:	Стаття
Мова:	English
Опубліковано:	Навчально-науковий комплекс "Інститут прикладного системного аналізу" НТУУ "КПІ" МОН та НАН України 2017
Назва видання:	Системні дослідження та інформаційні технології
Теми:	Математичні методи, моделі, проблеми і технології дослідження складних систем
Онлайн доступ:	http://dspace.nbuv.gov.ua/handle/123456789/151183
Теги:	Додати тег Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:	Digital Library of Periodicals of National Academy of Sciences of Ukraine
Цитувати:	Metric and algorithm for similarity between two temporal event sequences calculation / S. Nikolaiev // Системні дослідження та інформаційні технології. — 2017. — № 3. — С. 127-135. — Бібліогр.: 3 назв. — англ.

Репозитарії

Digital Library of Periodicals of National Academy of Sciences of Ukraine

id	irk-123456789-151183
record_format	dspace
spelling	irk-123456789-1511832019-04-26T01:25:29Z Metric and algorithm for similarity between two temporal event sequences calculation Nikolaiev, S. Математичні методи, моделі, проблеми і технології дослідження складних систем In several data analysis applications on temporal events flows, the problem of measuring "similarity" of these sequences arises. There are many different definitions of event sequences but in this paper under the term "sequence of events" the ordered array of the event occurrence times will be understood. In this paper the metric and procedure for similarity calculation between two ordered event sequences is presented. The procedure as the output returns measure of two event flows similarity and set of corresponding indices pairs which represent the mapping of the events between the input sequences. У деяких програмах з аналізу часових потоків подій виникає проблема вимірювання "подібності" послідовностей цих подій. У роботі під терміном "послідовність подій" розуміється упорядкований скінченний масив моментів виникнення події в часі. Подано метрику й алгоритм обчислення подібності між двома упорядкованими послідовностями подій. Описано процедуру, за якою розраховуються міра подібності двох потоків подій і набір пар індексів, що показують відповідні моменти виникнення події в кожній з цих послідовностей. В некоторых программах по анализу временных потоков событий возникает проблема измерения "сходства" последовательностей этих событий. В работе под термином "последовательность событий" понимается упорядоченный конечномерный массив моментов возникновения события во времени. Представлены метрика и алгоритм вычисления подобия между двумя упорядоченными последовательностями событий. Описана процедура, по которой рассчитываются мера сходства двух потоков событий и набор пар индексов, показывающих соответствующие моменты возникновения события в каждой из этих последовательностей. 2017 Article Metric and algorithm for similarity between two temporal event sequences calculation / S. Nikolaiev // Системні дослідження та інформаційні технології. — 2017. — № 3. — С. 127-135. — Бібліогр.: 3 назв. — англ. 1681–6048 DOI: 10.20535/SRIT.2308-8893.2017.3.12 http://dspace.nbuv.gov.ua/handle/123456789/151183 004.021, 004.052 en Системні дослідження та інформаційні технології Навчально-науковий комплекс "Інститут прикладного системного аналізу" НТУУ "КПІ" МОН та НАН України
institution	Digital Library of Periodicals of National Academy of Sciences of Ukraine
collection	DSpace DC
language	English
topic	Математичні методи, моделі, проблеми і технології дослідження складних систем Математичні методи, моделі, проблеми і технології дослідження складних систем
spellingShingle	Математичні методи, моделі, проблеми і технології дослідження складних систем Математичні методи, моделі, проблеми і технології дослідження складних систем Nikolaiev, S. Metric and algorithm for similarity between two temporal event sequences calculation Системні дослідження та інформаційні технології
description	In several data analysis applications on temporal events flows, the problem of measuring "similarity" of these sequences arises. There are many different definitions of event sequences but in this paper under the term "sequence of events" the ordered array of the event occurrence times will be understood. In this paper the metric and procedure for similarity calculation between two ordered event sequences is presented. The procedure as the output returns measure of two event flows similarity and set of corresponding indices pairs which represent the mapping of the events between the input sequences.
format	Article
author	Nikolaiev, S.
author_facet	Nikolaiev, S.
author_sort	Nikolaiev, S.
title	Metric and algorithm for similarity between two temporal event sequences calculation
title_short	Metric and algorithm for similarity between two temporal event sequences calculation
title_full	Metric and algorithm for similarity between two temporal event sequences calculation
title_fullStr	Metric and algorithm for similarity between two temporal event sequences calculation
title_full_unstemmed	Metric and algorithm for similarity between two temporal event sequences calculation
title_sort	metric and algorithm for similarity between two temporal event sequences calculation
publisher	Навчально-науковий комплекс "Інститут прикладного системного аналізу" НТУУ "КПІ" МОН та НАН України
publishDate	2017
topic_facet	Математичні методи, моделі, проблеми і технології дослідження складних систем
url	http://dspace.nbuv.gov.ua/handle/123456789/151183
citation_txt	Metric and algorithm for similarity between two temporal event sequences calculation / S. Nikolaiev // Системні дослідження та інформаційні технології. — 2017. — № 3. — С. 127-135. — Бібліогр.: 3 назв. — англ.
series	Системні дослідження та інформаційні технології
work_keys_str_mv	AT nikolaievs metricandalgorithmforsimilaritybetweentwotemporaleventsequencescalculation
first_indexed	2025-07-13T01:11:01Z
last_indexed	2025-07-13T01:11:01Z
_version_	1837492144931078144
fulltext	 S. Nikolaiev, 2017 Системні дослідження та інформаційні технології, 2017, № 3 127 УДК 004.021, 004.052 DOI: 10.20535/SRIT.2308-8893.2017.3.12 METRIC AND ALGORITHM FOR SIMILARITY BETWEEN TWO TEMPORAL EVENT SEQUENCES CALCULATION S. NIKOLAIEV Abstract. In several data analysis applications on temporal events flows, the prob- lem of measuring “similarity” of these sequences arises. There are many different definitions of event sequences but in this paper under the term “sequence of events” the ordered array of the event occurrence times will be understood. In this paper the metric and procedure for similarity calculation between two ordered event sequences is presented. The procedure as the output returns measure of two event flows simi- larity and set of corresponding indices pairs which represent the mapping of the events between the input sequences. Keywords: time stamped quasi periodic events; temporal event sequences similarity metric; computing distance between two real arrays of different length; accuracy, re- call, precision for temporal event sequences; algorithm for distance of two event flows estimation. INTRODUCTION Temporal event data is at the forefront of today’s Big Data boom. Across the world, sensors from IoT, telecommunication networks, medical equipment, gadg- ets, video cameras are capturing and storing every aspect of our existence as se- quences of time stamped events. And all these amounts of information with tem- poral events series are analyzed and mined for different kind of knowledge by a broad variety of specialists. Nowadays temporal sequence analysis is being used across multiple domains starting with sociology [1] where individual human lives are considered as sequences of events to be compared to pool of all known life- trajectories of society, and finishing with healthcare applications where events are extracted from quasi periodic bio-signals. In all these domains the task of measuring similarity between two temporal event sequences of different length arises. This problem often is named finding similar situations in events flows. More formally, the problem of finding similar situations can be defined as follows. Given a time t and a window width w, find another time s such that the windows of the sequence occurring in intervals ),( twt  and ),( sws  are simi- lar. The similarity between two slices can be defined using an edit distance notion [2], i.e., the distance is defined as the cost of the cheapest possible sequence of operations that transforms one slice to another. Usually the operations are inser- tion and deletion of an event and moving an event in time; each operation has an associated cost. The edit distance can be computed using a dynamic programming algorithm, but the computation is slow. Furthermore, assigning costs to the edit operations is quite problematic [3]. S. Nikolaiev ISSN 1681–6048 System Research & Information Technologies, 2017, № 3 128 This works well for applications with multiple types of events where subse- quences can be distinguished by type of the events. But if we have single-event- type domain then approach of finding longest common subsequence becomes not viable because event flows are represented as real-valued times series. So in gen- eral case we find zero or a few “common” elements between two sequences using direct comparison despite there are a lot of intuitively “close” elements. Single- type quasi periodic event temporal sequences comparison also begets problem of quasi periodic error function which complicates finding global error minimum and as the result computing the time shift between the sequences. In domains like healthcare the original sequence of bio-signals events some- times cannot be observed or measured directly because of different factors so of- ten approximation models are used instead. Proposed similarity metric is needed, e.g., for building and tuning parametric models that produce as output an ap- proximation of some temporal process in the form of event flow. It is supposed that the etalon sequence of events, generated by the process, can be measured only during experimental phase and therefore the problem of the model tuning can be formulated in terms of supervised learning. The specific of this point of view is that the target sequence and the output of the model are two finite real series of not equal length. To run any optimization procedures on two arrays of different length, metric is needed to calculate similarity between these arrays. Also taking into the account the fact that approximation procedures may work for millions of iterations on some domains and model simulation can take a lot of time, proposed similarity metric should be calculated in linear time from the length of the input arrays. Further in this paper the metric and the algorithm for computing similarity of two single-type quasi periodic event temporal sequences is proposed. This algo- rithm and the similarity metric were developed for optimization and supervised learning tasks where target variable is a temporal array of not fixed length. PROBLEM FORMULATION Suppose we have a process P that generates one type of event with non-constant interval. All event occurrences in some predefined time diapason can be put into ordered sequence of time stamps. Two sequences of events A and B are given: ,}{ ..1 , KiiatA  MiibtB ..1 , }{  , where iat , is time of thi event occurrence in A and , b it is time of thi event occur- rence in B. K and M correspond to lengths of A and B . It is known that the finite length sequence A is the target sequence of events obtained by the direct event detection from the process during experiment and times iat , in the sequence have zero lag. A method exists that gets as input some derivative signal generated by the process P and makes approximations of iat , The sequence B is obtained as an output of this parametric method. The lengths of A and B usually do not match Metric and algorithm for similarity between two temporal event sequences calculation Системні дослідження та інформаційні технології, 2017, № 3 129 and for the same event its occurrence times , iat and , ibt in A and B are shifted for varying value. The goal is to create metric and algorithm for A and B similarity evalua- tion. METRIC REQUIREMENTS The metric equals to one for the zero-error method. In this case B sequence should coincide with the sequence A : 1) length of A equals to length of B ; 2) iat , ibt , for all i in I . If the metric is less than one but greater than zero then either lengths of the sequences do not match or some of times ibt , differ from iat , . The lower the similarity between A and B , the smaller metric value. The metric value cannot be lower than zero or be greater than one. DEFINITIONS Let's call the event Bz “normal” if it “matches” to the event Az . We denote the set of normal events through Z. The event Az is called “missed” in B , if this event is not present in B . The event Bz is called “odd” (erroneous or false alarm), if this event is not present in A . In general MK  . This means that there are either odd events or missed ones. It is also known that the times of the same event occurrence iat , and ibt , in the sequences A and B may be shifted by a random value i elative to each other (Fig. 1 below) ibiai tt , ,  ; iibia tt  ,, . Example of the sequences is shown in the fig. 1 below: Fig 1. Example of event sequences A and B Odd event Missed event S. Nikolaiev ISSN 1681–6048 System Research & Information Technologies, 2017, № 3 130 In the sequence B the second and penultimate events are odd ones. 4z event from A is missing in B . DEFINITION OF "MATCHING EVENT" IN A AND B It is necessary to introduce the concept of matching events ztt jbia , , ~ in two given sequences of real numbers A and B where A is the target sequence of events. The event jbt , is called matching (~) to the event zt ia , if: a) jbt , has a minimum distance i to the event , a it by comparison with the distances to , 1a it  and to , 1a it  , and b) there are no other events kbt , for which the distance to , iat would be less than i . if    ,minarg , ,Card..1є jbiaAi tti  and    kbiaBkjbiai tttt , ,Card..1є , , min  . then , , ~ iiajb ztt  If there are two events , jbt and 1, jbt with equal distances to the event iat , , then the event jbt , will be considered as matching to event iat , . The fig. 2 shows how target events from sequence A match to approximated events in sequence B . White and shaded areas indicate membership time intervals that correspond to target events iz from A . All events from B that are present in the time interval corresponding to event iz are considered to be candidates for normal event. Odd event Missed event Fig 2. Correspondence of events in A to events in sequence B Metric and algorithm for similarity between two temporal event sequences calculation Системні дослідження та інформаційні технології, 2017, № 3 131 DERIVATION OF SIMILARITY METRIC FOR EVENT SEQUENCES Let MK  and assume that B has no pairs of false events identifications of dif- ferent types (for example, odd and missed event at the same time) in the entire interval under consideration so that A and B contain only normal events. In this case, A and B can be treated as two finite real vectors with the same dimensions, where the thi element in each of the vectors corresponds to the thi event. For two real valued vectors with equal lengths a lot of distance and (dis)similarity metrics are available. As an example generalized distance metric can be treated as error or dissimilarity metric: m k ibia M i tt M d , , 1 1    . Depending on m and k we obtain the formulas for calculating arithmetic mean, average and standard deviation. For normal events from B we define the root mean square error in the form of: , )(Card 2 є Z RMSE zZz Z    where )(Card Z is the cardinality of the set Z and it shows the number of normal events in Z . The smaller the RMSE metric value, the more accurate the method. If MK  then at least one missed or one odd event is present in B . In this case false positives and event omissions should be separated, and then metrics from the theory of classification can be applied. The difference between known metrics and proposed metric is that the theory of classification usually re- quires the presence of at least two classes of objects and the task is to determine to which of these classes object belongs. But in our case, there is only one class of the event and the method decides in which moments the event occurs so there may be missed events or multiple detections of false alarms. This mainly happens because the method uses indirect evidences to determine sequence B without knowing anything about sequence A . Suppose we know the true separation of events for B (i.e., we know which of the events are normal, false positives and missing). Then the following metrics can be applied to the sequences of events B and A : K — total number of events in A ; TP (true positives) )..0[ K — the number of normal (coincident) events in B with respect to A . ( )(Card ZTP  ) FP (false positives) )..0[  — the number of odd events in B (false pos- itives detected by the method). Each event from A can have a lot of false posi- tives in B . (Type I error); FN (false negatives) )..0[ K — the number of missed events. (type II er- ror); Let Acc be the accuracy of finding the set of normal events: ]1,0 [      FNK TP FNFPTP TP Acc . S. Nikolaiev ISSN 1681–6048 System Research & Information Technologies, 2017, № 3 132 The closer Acc to one, the less erroneous detections of events are present in B . Let R (Recall) be the percentage of normal events among all target events. ]1,0[    K TP FPTP TP R . The closer this metric to one, the more precise method. RAcc  . Summarizing the cases described above, we obtain the metric for evaluating the quality of the method on the given two sequences A and B : max]1,0[ 1 Q    ZRMSE Acc . where ) . . 0 [  is weight, which determines the prevalence of normal events local shifts error ZRMSE over the accuracy of finding the set of normal events Z . The higher  the more important becomes precise localization of nor- mal events from B in time. The concrete value of α is calculated for a particular pair of sequences A and B , depending on objectives of the study. The higher Q the better sequence B approximates the sequence A. This crite- rion reaches its maximum value of one for two identical sequences of A and B . ALGORITHM FOR THE Q METRIC CALCULATION In previous chapter the formula for similarity metric Q between two sequences of events A and B was introduces. The cornerstone of this metric is the algorithm for defining the sets of normal, missed and odd events. The basic idea of the algorithm is to find the right ( ) right,ib and left ( ) left, ib borders or margins for each element , iat from A and in the resulting interval be- tween these boundaries determine the nearest event jbt , from B. Each of the bor- ders is calculated as the middle of the intervals between adjacent with , iat events 1, iat and 1, iat 10 , 2 1 , , left,     Ki tt b iaia i . 10 , 2 1 , , right,     Ki tt b iaia i In case 0i  instead of using not existent 1, at , left, 0b is passed to algorithm as input parameter. The same applies to right,Kb . This algorithm after performing calculation on given input sequences A and B , returns three arrays: two arrays with indices of missed and odd events in B and one array with pairs of indices of corresponding normal events. From the third array i are easily calculated. Further the metric Q is calculated according to the formula. Metric and algorithm for similarity between two temporal event sequences calculation Системні дослідження та інформаційні технології, 2017, № 3 133 Below the algorithm in Python is presented: def detect_tp(a, b, left_margin=1, right_margin=1): miss = [] # Declaration of an empty array odd = [] ijs = [] j = 0 for i in range(len(a)): b_candidates = [] l_marg, r_marg = get_margins(a, i, left_margin, right_margin) # collect odd js while len(b) > j and b[j] < l_marg: odd.append(j) j + = 1 # collect candidate js while len(b) > j and l_marg <= b[j] < r_marg: b_candidates.append(j) j += 1 # process collected js if len(b_candidates) == 0: miss.append(i) elif len(b_candidates) == 1 ijs.append((i, b_candidates[0])) else: jj = find_best_j_match(b_candidates, a[i], b) for k in b_candidates: if k != jj: odd.append(k) else: ijs.append((i, jj)) # collect rest bs that are further than the right margin of the last a[-1] while len(b) > j: odd.append(j) j += 1 return ijs, miss, odd Function detect_tp takes two arrays A and B , as well as the left and right default borders as input parameters left_margin and right_margin for the first and last events in the sequences. After completion, this function returns three se- quences: ijs, miss and odd, where ijs is a sequence of pairs ),( ji of indices events , At ia  and Bt jb ,  where iiajb ztt  ,, ~ . This function also calls function get_margins that calculates margins for thi event , At ia  def get_margins(a, i, left_margin, right_margin): # calculate left margin if i == 0 or len(a) <= 1: if len(a) == 0: S. Nikolaiev ISSN 1681–6048 System Research & Information Technologies, 2017, № 3 134 l_marg = 0 else: l_marg = a[i] - left_margin else: # i>0 l_marg = (a[i] + a[i - 1]) / 2.0 # calculate right margin if i == len(a) - 1 or len(a) <= 1: if len(a) == 0: r_marg = 0 else: r_marg = a[i] + right_margin else: r_marg = (a[i] + a[i + 1]) / 2.0 return l_marg, r_marg The function named find_best_j_match is designed for finding the best candidate in array of proposed candidates b_candidates from B for given event iat , . It returns the index for the closest event among b_candidates to a given event av from A . def find_best_j_match(b_candidates, av, b): minj = b_candidates[0] for k in b_candidates: if abs(av - b[minj]) > abs(av - b[k]): minj = k return minj CONCLUSION In this paper the metric and the algorithm for computing similarity of two single- type quasi periodic event temporal sequences was described. The procedure as output returns sequence similarity metric and set of corresponding indices pairs of given occurrences of the event showing its most probable location in each of input arrays. The algorithm complexity is )( KMO  where M and K are lengths of input sequences. Successful modification of existing approaches on dissimilarity metrics over vectors with equal dimensions and their combination with classification metrics for events can be concluded. This process as the result led to creation of single metric for similarity calculation of different length temporal sequences. The experiments show that the method is performing quite good at separat- ing noisy event occurrences from normal event flow. The described metric Q for calculating similarity between two event series of different length returns intui- tively plausible results. The algorithm for events classification into normal, odd and false positives types works best when the time shift between input sequences is in range of one period of event occurrence. The experiments conducted show that the shift can be automatically minimized by iteratively applying the method of gradient descent using developed metric. The developed metric can be used in single-event Metric and algorithm for similarity between two temporal event sequences calculation Системні дослідження та інформаційні технології, 2017, № 3 135 multiple-sequence analysis (MSA) applications including but not limited to opti- mization and supervised-learning problems. Further the author plans to research application of proposed metric and algo- rithm for extracting heart-beats intervals from video. REFERENCES 1. Pollock Gary. Holistic trajectories: a study of combined employment, housing and family careers by using multiple-sequence analysis. Journal of the Royal Statisti- cal Society, Series A (Statistics in Society), 170(1): P. 167–183, 2007. 2. Mannila H., Moen P. Similarity between event types in sequences, Proc. First Intl. Conf. on Data Warehousing and Knowledge Discovery (DaWaK’99), Florence, Italy, 1999, P. 271–280. 3. Moen P., Attribute, Event Sequence, and Event Type Similarity Notions for Data Mining, Ph.D. thesis, Department of Computer Science, University of Helsinki, Finland, 2000. Received 29.11.2016

Metric and algorithm for similarity between two temporal event sequences calculation

Репозитарії

Схожі ресурси