Metric and algorithm for similarity between two temporal event sequences calculation
In several data analysis applications on temporal events flows, the problem of measuring "similarity" of these sequences arises. There are many different definitions of event sequences but in this paper under the term "sequence of events" the ordered array of the event occurrence...
Збережено в:
Дата: | 2017 |
---|---|
Автор: | |
Формат: | Стаття |
Мова: | English |
Опубліковано: |
Навчально-науковий комплекс "Інститут прикладного системного аналізу" НТУУ "КПІ" МОН та НАН України
2017
|
Назва видання: | Системні дослідження та інформаційні технології |
Теми: | |
Онлайн доступ: | http://dspace.nbuv.gov.ua/handle/123456789/151183 |
Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
Назва журналу: | Digital Library of Periodicals of National Academy of Sciences of Ukraine |
Цитувати: | Metric and algorithm for similarity between two temporal event sequences calculation / S. Nikolaiev // Системні дослідження та інформаційні технології. — 2017. — № 3. — С. 127-135. — Бібліогр.: 3 назв. — англ. |
Репозитарії
Digital Library of Periodicals of National Academy of Sciences of Ukraineid |
irk-123456789-151183 |
---|---|
record_format |
dspace |
spelling |
irk-123456789-1511832019-04-26T01:25:29Z Metric and algorithm for similarity between two temporal event sequences calculation Nikolaiev, S. Математичні методи, моделі, проблеми і технології дослідження складних систем In several data analysis applications on temporal events flows, the problem of measuring "similarity" of these sequences arises. There are many different definitions of event sequences but in this paper under the term "sequence of events" the ordered array of the event occurrence times will be understood. In this paper the metric and procedure for similarity calculation between two ordered event sequences is presented. The procedure as the output returns measure of two event flows similarity and set of corresponding indices pairs which represent the mapping of the events between the input sequences. У деяких програмах з аналізу часових потоків подій виникає проблема вимірювання "подібності" послідовностей цих подій. У роботі під терміном "послідовність подій" розуміється упорядкований скінченний масив моментів виникнення події в часі. Подано метрику й алгоритм обчислення подібності між двома упорядкованими послідовностями подій. Описано процедуру, за якою розраховуються міра подібності двох потоків подій і набір пар індексів, що показують відповідні моменти виникнення події в кожній з цих послідовностей. В некоторых программах по анализу временных потоков событий возникает проблема измерения "сходства" последовательностей этих событий. В работе под термином "последовательность событий" понимается упорядоченный конечномерный массив моментов возникновения события во времени. Представлены метрика и алгоритм вычисления подобия между двумя упорядоченными последовательностями событий. Описана процедура, по которой рассчитываются мера сходства двух потоков событий и набор пар индексов, показывающих соответствующие моменты возникновения события в каждой из этих последовательностей. 2017 Article Metric and algorithm for similarity between two temporal event sequences calculation / S. Nikolaiev // Системні дослідження та інформаційні технології. — 2017. — № 3. — С. 127-135. — Бібліогр.: 3 назв. — англ. 1681–6048 DOI: 10.20535/SRIT.2308-8893.2017.3.12 http://dspace.nbuv.gov.ua/handle/123456789/151183 004.021, 004.052 en Системні дослідження та інформаційні технології Навчально-науковий комплекс "Інститут прикладного системного аналізу" НТУУ "КПІ" МОН та НАН України |
institution |
Digital Library of Periodicals of National Academy of Sciences of Ukraine |
collection |
DSpace DC |
language |
English |
topic |
Математичні методи, моделі, проблеми і технології дослідження складних систем Математичні методи, моделі, проблеми і технології дослідження складних систем |
spellingShingle |
Математичні методи, моделі, проблеми і технології дослідження складних систем Математичні методи, моделі, проблеми і технології дослідження складних систем Nikolaiev, S. Metric and algorithm for similarity between two temporal event sequences calculation Системні дослідження та інформаційні технології |
description |
In several data analysis applications on temporal events flows, the problem of measuring "similarity" of these sequences arises. There are many different definitions of event sequences but in this paper under the term "sequence of events" the ordered array of the event occurrence times will be understood. In this paper the metric and procedure for similarity calculation between two ordered event sequences is presented. The procedure as the output returns measure of two event flows similarity and set of corresponding indices pairs which represent the mapping of the events between the input sequences. |
format |
Article |
author |
Nikolaiev, S. |
author_facet |
Nikolaiev, S. |
author_sort |
Nikolaiev, S. |
title |
Metric and algorithm for similarity between two temporal event sequences calculation |
title_short |
Metric and algorithm for similarity between two temporal event sequences calculation |
title_full |
Metric and algorithm for similarity between two temporal event sequences calculation |
title_fullStr |
Metric and algorithm for similarity between two temporal event sequences calculation |
title_full_unstemmed |
Metric and algorithm for similarity between two temporal event sequences calculation |
title_sort |
metric and algorithm for similarity between two temporal event sequences calculation |
publisher |
Навчально-науковий комплекс "Інститут прикладного системного аналізу" НТУУ "КПІ" МОН та НАН України |
publishDate |
2017 |
topic_facet |
Математичні методи, моделі, проблеми і технології дослідження складних систем |
url |
http://dspace.nbuv.gov.ua/handle/123456789/151183 |
citation_txt |
Metric and algorithm for similarity between two temporal event sequences calculation / S. Nikolaiev // Системні дослідження та інформаційні технології. — 2017. — № 3. — С. 127-135. — Бібліогр.: 3 назв. — англ. |
series |
Системні дослідження та інформаційні технології |
work_keys_str_mv |
AT nikolaievs metricandalgorithmforsimilaritybetweentwotemporaleventsequencescalculation |
first_indexed |
2025-07-13T01:11:01Z |
last_indexed |
2025-07-13T01:11:01Z |
_version_ |
1837492144931078144 |
fulltext |
S. Nikolaiev, 2017
Системні дослідження та інформаційні технології, 2017, № 3 127
УДК 004.021, 004.052
DOI: 10.20535/SRIT.2308-8893.2017.3.12
METRIC AND ALGORITHM FOR SIMILARITY BETWEEN
TWO TEMPORAL EVENT SEQUENCES CALCULATION
S. NIKOLAIEV
Abstract. In several data analysis applications on temporal events flows, the prob-
lem of measuring “similarity” of these sequences arises. There are many different
definitions of event sequences but in this paper under the term “sequence of events”
the ordered array of the event occurrence times will be understood. In this paper the
metric and procedure for similarity calculation between two ordered event sequences
is presented. The procedure as the output returns measure of two event flows simi-
larity and set of corresponding indices pairs which represent the mapping of the
events between the input sequences.
Keywords: time stamped quasi periodic events; temporal event sequences similarity
metric; computing distance between two real arrays of different length; accuracy, re-
call, precision for temporal event sequences; algorithm for distance of two event
flows estimation.
INTRODUCTION
Temporal event data is at the forefront of today’s Big Data boom. Across the
world, sensors from IoT, telecommunication networks, medical equipment, gadg-
ets, video cameras are capturing and storing every aspect of our existence as se-
quences of time stamped events. And all these amounts of information with tem-
poral events series are analyzed and mined for different kind of knowledge by a
broad variety of specialists. Nowadays temporal sequence analysis is being used
across multiple domains starting with sociology [1] where individual human lives
are considered as sequences of events to be compared to pool of all known life-
trajectories of society, and finishing with healthcare applications where events are
extracted from quasi periodic bio-signals.
In all these domains the task of measuring similarity between two temporal
event sequences of different length arises. This problem often is named finding
similar situations in events flows.
More formally, the problem of finding similar situations can be defined as
follows. Given a time t and a window width w, find another time s such that the
windows of the sequence occurring in intervals ),( twt and ),( sws are simi-
lar. The similarity between two slices can be defined using an edit distance notion
[2], i.e., the distance is defined as the cost of the cheapest possible sequence of
operations that transforms one slice to another. Usually the operations are inser-
tion and deletion of an event and moving an event in time; each operation has an
associated cost. The edit distance can be computed using a dynamic programming
algorithm, but the computation is slow. Furthermore, assigning costs to the edit
operations is quite problematic [3].
S. Nikolaiev
ISSN 1681–6048 System Research & Information Technologies, 2017, № 3 128
This works well for applications with multiple types of events where subse-
quences can be distinguished by type of the events. But if we have single-event-
type domain then approach of finding longest common subsequence becomes not
viable because event flows are represented as real-valued times series. So in gen-
eral case we find zero or a few “common” elements between two sequences using
direct comparison despite there are a lot of intuitively “close” elements. Single-
type quasi periodic event temporal sequences comparison also begets problem of
quasi periodic error function which complicates finding global error minimum
and as the result computing the time shift between the sequences.
In domains like healthcare the original sequence of bio-signals events some-
times cannot be observed or measured directly because of different factors so of-
ten approximation models are used instead. Proposed similarity metric is needed,
e.g., for building and tuning parametric models that produce as output an ap-
proximation of some temporal process in the form of event flow. It is supposed
that the etalon sequence of events, generated by the process, can be measured
only during experimental phase and therefore the problem of the model tuning
can be formulated in terms of supervised learning. The specific of this point of
view is that the target sequence and the output of the model are two finite real
series of not equal length. To run any optimization procedures on two arrays of
different length, metric is needed to calculate similarity between these arrays.
Also taking into the account the fact that approximation procedures may work for
millions of iterations on some domains and model simulation can take a lot of
time, proposed similarity metric should be calculated in linear time from the
length of the input arrays.
Further in this paper the metric and the algorithm for computing similarity of
two single-type quasi periodic event temporal sequences is proposed. This algo-
rithm and the similarity metric were developed for optimization and supervised
learning tasks where target variable is a temporal array of not fixed length.
PROBLEM FORMULATION
Suppose we have a process P that generates one type of event with non-constant
interval. All event occurrences in some predefined time diapason can be put into
ordered sequence of time stamps.
Two sequences of events A and B are given:
,}{ ..1 , KiiatA MiibtB ..1 , }{ ,
where iat , is time of thi event occurrence in A and , b it is time of thi event occur-
rence in B. K and M correspond to lengths of A and B .
It is known that the finite length sequence A is the target sequence of events
obtained by the direct event detection from the process during experiment and
times iat , in the sequence have zero lag.
A method exists that gets as input some derivative signal generated by the
process P and makes approximations of iat , The sequence B is obtained as an
output of this parametric method. The lengths of A and B usually do not match
Metric and algorithm for similarity between two temporal event sequences calculation
Системні дослідження та інформаційні технології, 2017, № 3 129
and for the same event its occurrence times , iat and , ibt in A and B are shifted
for varying value.
The goal is to create metric and algorithm for A and B similarity evalua-
tion.
METRIC REQUIREMENTS
The metric equals to one for the zero-error method. In this case B sequence
should coincide with the sequence A :
1) length of A equals to length of B ;
2) iat , ibt , for all i in I .
If the metric is less than one but greater than zero then either lengths of the
sequences do not match or some of times ibt , differ from iat , . The lower the
similarity between A and B , the smaller metric value.
The metric value cannot be lower than zero or be greater than one.
DEFINITIONS
Let's call the event Bz “normal” if it “matches” to the event Az . We denote
the set of normal events through Z.
The event Az is called “missed” in B , if this event is not present in B .
The event Bz is called “odd” (erroneous or false alarm), if this event is
not present in A .
In general MK . This means that there are either odd events or missed
ones. It is also known that the times of the same event occurrence iat , and ibt , in
the sequences A and B may be shifted by a random value i elative to each
other (Fig. 1 below)
ibiai tt , , ; iibia tt ,, .
Example of the sequences is shown in the fig. 1 below:
Fig 1. Example of event sequences A and B
Odd event Missed event
S. Nikolaiev
ISSN 1681–6048 System Research & Information Technologies, 2017, № 3 130
In the sequence B the second and penultimate events are odd ones. 4z
event from A is missing in B .
DEFINITION OF "MATCHING EVENT" IN A AND B
It is necessary to introduce the concept of matching events ztt jbia , , ~ in two
given sequences of real numbers A and B where A is the target sequence of
events.
The event jbt , is called matching (~) to the event zt ia , if:
a) jbt , has a minimum distance i to the event , a it by comparison with the
distances to , 1a it and to , 1a it , and
b) there are no other events kbt , for which the distance to , iat would be less
than i .
if ,minarg , ,Card..1є jbiaAi tti and
kbiaBkjbiai tttt , ,Card..1є , , min .
then , , ~ iiajb ztt
If there are two events , jbt and 1, jbt with equal distances to the event iat , ,
then the event jbt , will be considered as matching to event iat , .
The fig. 2 shows how target events from sequence A match to approximated
events in sequence B . White and shaded areas indicate membership time
intervals that correspond to target events iz from A . All events from B that are
present in the time interval corresponding to event iz are considered to be
candidates for normal event.
Odd event Missed event
Fig 2. Correspondence of events in A to events in sequence B
Metric and algorithm for similarity between two temporal event sequences calculation
Системні дослідження та інформаційні технології, 2017, № 3 131
DERIVATION OF SIMILARITY METRIC FOR EVENT SEQUENCES
Let MK and assume that B has no pairs of false events identifications of dif-
ferent types (for example, odd and missed event at the same time) in the entire
interval under consideration so that A and B contain only normal events.
In this case, A and B can be treated as two finite real vectors with the same
dimensions, where the thi element in each of the vectors corresponds to the thi
event. For two real valued vectors with equal lengths a lot of distance and
(dis)similarity metrics are available. As an example generalized distance metric
can be treated as error or dissimilarity metric:
m
k
ibia
M
i
tt
M
d , ,
1
1
.
Depending on m and k we obtain the formulas for calculating arithmetic
mean, average and standard deviation.
For normal events from B we define the root mean square error in the form of:
,
)(Card
2
є
Z
RMSE zZz
Z
where )(Card Z is the cardinality of the set Z and it shows the number of normal
events in Z . The smaller the RMSE metric value, the more accurate the method.
If MK then at least one missed or one odd event is present in B .
In this case false positives and event omissions should be separated, and then
metrics from the theory of classification can be applied. The difference between
known metrics and proposed metric is that the theory of classification usually re-
quires the presence of at least two classes of objects and the task is to determine
to which of these classes object belongs. But in our case, there is only one class of
the event and the method decides in which moments the event occurs so there
may be missed events or multiple detections of false alarms. This mainly happens
because the method uses indirect evidences to determine sequence B without
knowing anything about sequence A .
Suppose we know the true separation of events for B (i.e., we know which
of the events are normal, false positives and missing). Then the following metrics
can be applied to the sequences of events B and A :
K — total number of events in A ;
TP (true positives) )..0[ K — the number of normal (coincident) events in
B with respect to A . ( )(Card ZTP )
FP (false positives) )..0[ — the number of odd events in B (false pos-
itives detected by the method). Each event from A can have a lot of false posi-
tives in B . (Type I error);
FN (false negatives) )..0[ K — the number of missed events. (type II er-
ror);
Let Acc be the accuracy of finding the set of normal events:
]1,0 [
FNK
TP
FNFPTP
TP
Acc .
S. Nikolaiev
ISSN 1681–6048 System Research & Information Technologies, 2017, № 3 132
The closer Acc to one, the less erroneous detections of events are present
in B .
Let R (Recall) be the percentage of normal events among all target events.
]1,0[
K
TP
FPTP
TP
R .
The closer this metric to one, the more precise method.
RAcc .
Summarizing the cases described above, we obtain the metric for evaluating
the quality of the method on the given two sequences A and B :
max]1,0[
1
Q
ZRMSE
Acc
.
where ) . . 0 [ is weight, which determines the prevalence of normal events
local shifts error ZRMSE over the accuracy of finding the set of normal
events Z . The higher the more important becomes precise localization of nor-
mal events from B in time. The concrete value of α is calculated for a particular
pair of sequences A and B , depending on objectives of the study.
The higher Q the better sequence B approximates the sequence A. This crite-
rion reaches its maximum value of one for two identical sequences of A and B .
ALGORITHM FOR THE Q METRIC CALCULATION
In previous chapter the formula for similarity metric Q between two sequences of
events A and B was introduces. The cornerstone of this metric is the algorithm
for defining the sets of normal, missed and odd events.
The basic idea of the algorithm is to find the right ( ) right,ib and left ( ) left, ib
borders or margins for each element , iat from A and in the resulting interval be-
tween these boundaries determine the nearest event jbt , from B. Each of the bor-
ders is calculated as the middle of the intervals between adjacent with , iat events
1, iat and 1, iat
10 ,
2
1 , ,
left,
Ki
tt
b iaia
i .
10 ,
2
1 , ,
right,
Ki
tt
b iaia
i
In case 0i instead of using not existent 1, at , left, 0b is passed to algorithm
as input parameter. The same applies to right,Kb .
This algorithm after performing calculation on given input sequences A and
B , returns three arrays: two arrays with indices of missed and odd events in B
and one array with pairs of indices of corresponding normal events. From the
third array i are easily calculated. Further the metric Q is calculated according
to the formula.
Metric and algorithm for similarity between two temporal event sequences calculation
Системні дослідження та інформаційні технології, 2017, № 3 133
Below the algorithm in Python is presented:
def detect_tp(a, b, left_margin=1, right_margin=1):
miss = [] # Declaration of an empty array
odd = []
ijs = []
j = 0
for i in range(len(a)):
b_candidates = []
l_marg, r_marg = get_margins(a, i, left_margin,
right_margin)
# collect odd js
while len(b) > j and b[j] < l_marg:
odd.append(j)
j + = 1
# collect candidate js
while len(b) > j and l_marg <= b[j] < r_marg:
b_candidates.append(j)
j += 1
# process collected js
if len(b_candidates) == 0:
miss.append(i)
elif len(b_candidates) == 1
ijs.append((i, b_candidates[0]))
else:
jj = find_best_j_match(b_candidates, a[i], b)
for k in b_candidates:
if k != jj:
odd.append(k)
else:
ijs.append((i, jj))
# collect rest bs that are further than the right
margin of the last a[-1]
while len(b) > j:
odd.append(j)
j += 1
return ijs, miss, odd
Function detect_tp takes two arrays A and B , as well as the left and right
default borders as input parameters left_margin and right_margin for the first
and last events in the sequences. After completion, this function returns three se-
quences: ijs, miss and odd, where ijs is a sequence of pairs ),( ji of indices events
, At ia and Bt jb , where iiajb ztt ,, ~ .
This function also calls function get_margins that calculates margins for thi
event , At ia
def get_margins(a, i, left_margin, right_margin):
# calculate left margin
if i == 0 or len(a) <= 1:
if len(a) == 0:
S. Nikolaiev
ISSN 1681–6048 System Research & Information Technologies, 2017, № 3 134
l_marg = 0
else:
l_marg = a[i] - left_margin
else: # i>0
l_marg = (a[i] + a[i - 1]) / 2.0
# calculate right margin
if i == len(a) - 1 or len(a) <= 1:
if len(a) == 0:
r_marg = 0
else:
r_marg = a[i] + right_margin
else:
r_marg = (a[i] + a[i + 1]) / 2.0
return l_marg, r_marg
The function named find_best_j_match is designed for finding the best
candidate in array of proposed candidates b_candidates from B for given event
iat , . It returns the index for the closest event among b_candidates to a given
event av from A .
def find_best_j_match(b_candidates, av, b):
minj = b_candidates[0]
for k in b_candidates:
if abs(av - b[minj]) > abs(av - b[k]):
minj = k
return minj
CONCLUSION
In this paper the metric and the algorithm for computing similarity of two single-
type quasi periodic event temporal sequences was described. The procedure as
output returns sequence similarity metric and set of corresponding indices pairs of
given occurrences of the event showing its most probable location in each of input
arrays. The algorithm complexity is )( KMO where M and K are lengths of
input sequences.
Successful modification of existing approaches on dissimilarity metrics over
vectors with equal dimensions and their combination with classification metrics
for events can be concluded. This process as the result led to creation of single
metric for similarity calculation of different length temporal sequences.
The experiments show that the method is performing quite good at separat-
ing noisy event occurrences from normal event flow. The described metric Q for
calculating similarity between two event series of different length returns intui-
tively plausible results.
The algorithm for events classification into normal, odd and false positives
types works best when the time shift between input sequences is in range of one
period of event occurrence. The experiments conducted show that the shift can be
automatically minimized by iteratively applying the method of gradient descent
using developed metric. The developed metric can be used in single-event
Metric and algorithm for similarity between two temporal event sequences calculation
Системні дослідження та інформаційні технології, 2017, № 3 135
multiple-sequence analysis (MSA) applications including but not limited to opti-
mization and supervised-learning problems.
Further the author plans to research application of proposed metric and algo-
rithm for extracting heart-beats intervals from video.
REFERENCES
1. Pollock Gary. Holistic trajectories: a study of combined employment, housing and
family careers by using multiple-sequence analysis. Journal of the Royal Statisti-
cal Society, Series A (Statistics in Society), 170(1): P. 167–183, 2007.
2. Mannila H., Moen P. Similarity between event types in sequences, Proc. First Intl.
Conf. on Data Warehousing and Knowledge Discovery (DaWaK’99), Florence,
Italy, 1999, P. 271–280.
3. Moen P., Attribute, Event Sequence, and Event Type Similarity Notions for Data
Mining, Ph.D. thesis, Department of Computer Science, University of Helsinki,
Finland, 2000.
Received 29.11.2016
|