Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples
The way of comparison of discriminant functions with dividing samples observations on training and testing subsamples is proved. Conditions of existence of optimum set of features which depend on parameters of general sets and volumes samples are received. Laws of simplification of optimum discrimin...
Збережено в:
Дата: | 2012 |
---|---|
Автори: | , |
Формат: | Стаття |
Мова: | English |
Опубліковано: |
Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України
2012
|
Назва видання: | Індуктивне моделювання складних систем |
Онлайн доступ: | http://dspace.nbuv.gov.ua/handle/123456789/45954 |
Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
Назва журналу: | Digital Library of Periodicals of National Academy of Sciences of Ukraine |
Цитувати: | Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples / A. Sarychev, L. Sarycheva // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2012. — Вип. 4. — С. 21-27. — Бібліогр.: 7 назв. — англ. |
Репозитарії
Digital Library of Periodicals of National Academy of Sciences of Ukraineid |
irk-123456789-45954 |
---|---|
record_format |
dspace |
spelling |
irk-123456789-459542013-06-22T03:16:18Z Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples Sarychev, A. Sarycheva, L. The way of comparison of discriminant functions with dividing samples observations on training and testing subsamples is proved. Conditions of existence of optimum set of features which depend on parameters of general sets and volumes samples are received. Laws of simplification of optimum discriminant function at decrease of volumes samples and at increase of dispersions of features are revealed. Обґрунтовано спосіб порівняння дискримінантних функцій з розбиттям вибірок спостережень на навчальні й перевірні підвибірки. Отримано умови існування оптимальної множини ознак, які залежать від параметрів генеральних сукупностей і обсягів вибірок. Виявлено закономірності спрощення оптимальної дискримінантної функції при зменшенні обсягів вибірок і при збільшенні дисперсій ознак. Обоснован способ сравнения дискриминантных функций с разбиением выборок наблюдений на обучающие и проверочные подвыборки. Получены условия существования оптимального множества признаков, которые зависят от параметров генеральных совокупностей и объемов выборок. Выявлены закономерности упрощения оптимальной дискриминантной функции при уменьшении объемов выборок и при увеличении дисперсий признаков. 2012 Article Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples / A. Sarychev, L. Sarycheva // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2012. — Вип. 4. — С. 21-27. — Бібліогр.: 7 назв. — англ. XXXX-0044 http://dspace.nbuv.gov.ua/handle/123456789/45954 519.25 en Індуктивне моделювання складних систем Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України |
institution |
Digital Library of Periodicals of National Academy of Sciences of Ukraine |
collection |
DSpace DC |
language |
English |
description |
The way of comparison of discriminant functions with dividing samples observations on training and testing subsamples is proved. Conditions of existence of optimum set of features which depend on parameters of general sets and volumes samples are received. Laws of simplification of optimum discriminant function at decrease of volumes samples and at increase of dispersions of features are revealed. |
format |
Article |
author |
Sarychev, A. Sarycheva, L. |
spellingShingle |
Sarychev, A. Sarycheva, L. Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples Індуктивне моделювання складних систем |
author_facet |
Sarychev, A. Sarycheva, L. |
author_sort |
Sarychev, A. |
title |
Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples |
title_short |
Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples |
title_full |
Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples |
title_fullStr |
Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples |
title_full_unstemmed |
Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples |
title_sort |
discriminant functions quality estimation on the basis of training and testing samples |
publisher |
Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України |
publishDate |
2012 |
url |
http://dspace.nbuv.gov.ua/handle/123456789/45954 |
citation_txt |
Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples / A. Sarychev, L. Sarycheva // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2012. — Вип. 4. — С. 21-27. — Бібліогр.: 7 назв. — англ. |
series |
Індуктивне моделювання складних систем |
work_keys_str_mv |
AT sarycheva discriminantfunctionsqualityestimationonthebasisoftrainingandtestingsamples AT sarycheval discriminantfunctionsqualityestimationonthebasisoftrainingandtestingsamples |
first_indexed |
2025-07-04T05:00:09Z |
last_indexed |
2025-07-04T05:00:09Z |
_version_ |
1836691187443957760 |
fulltext |
Sarychev A.P., Sarycheva L.V.
Індуктивне моделювання складних систем, випуск 4, 2012 21
УДК 519.25
DISCRIMINANT FUNCTIONS QUALITY ESTIMATION
ON THE BASIS OF TRAINING AND TESTING SAMPLES
Alexander Sarychev 1 and Lyudmyla Sarycheva 2
1 Institute of Technical Mechanics of the National Academy of Sciences of Ukraine,
15 Leshko-Popel St., Dnipropetrovs’k, 49005, Ukraine
2 National Mining University of Ukraine, 19 K. Marks Ave., Dnipropetrovs’k, 49027, Ukraine
Sarychev@prognoz.dp.ua, Sarycheval@nmu.org.ua
Обґрунтовано спосіб порівняння дискримінантних функцій з розбиттям вибірок
спостережень на навчальні й перевірні підвибірки. Отримано умови існування оптимальної
множини ознак, які залежать від параметрів генеральних сукупностей і обсягів вибірок.
Виявлено закономірності спрощення оптимальної дискримінантної функції при зменшенні
обсягів вибірок і при збільшенні дисперсій ознак.
Ключові слова: метод групового урахування аргументів, невизначеність за складом ознак,
критерій якості лінійної дискримінантної функції.
The way of comparison of discriminant functions with dividing samples observations on training
and testing subsamples is proved. Conditions of existence of optimum set of features which depend
on parameters of general sets and volumes samples are received. Laws of simplification of optimum
discriminant function at decrease of volumes samples and at increase of dispersions of features are
revealed.
Keywords: Group Method of Data Handling, uncertainty on structure of features, criterion of linear
discriminant function quality.
Обоснован способ сравнения дискриминантных функций с разбиением выборок наблюдений
на обучающие и проверочные подвыборки. Получены условия существования оптимального
множества признаков, которые зависят от параметров генеральных совокупностей и объемов
выборок. Выявлены закономерности упрощения оптимальной дискриминантной функции
при уменьшении объемов выборок и при увеличении дисперсий признаков.
Ключевые слова: метод группового учета аргументов, неопределенность по составу
признаков, критерий качества линейной дискриминантной функции.
Introduction
The decision of task of the discriminant analysis in conditions of structural
uncertainty on structure of features assumes acceptance of any way of comparison of
discriminant functions which are constructed on various sets of features. Two ways of
comparison are popular in practice. The first way is based on dividing of observations
on training and testing subsamples. In this way training subsamples are used for
estimation coefficients of discriminant functions, and testing subsamples are used for
estimation its qualities of classification. The second way is sliding examination. In
this way observations which are serially excluded from training subsamples are used
as testing observations. In the literature these ways are traditionally treated
as heuristic methods though the fact of existence in them of optimum set of features
repeatedly proved by a method of statistical tests. In the Group Method of Data
Handling (GMDH) analytical research of these two ways is carried out [1-4]. For the
decision of a task of the discriminant analysis in conditions of structural uncertainty
Discriminant functions quality estimation
Індуктивне моделювання складних систем, випуск 4, 2012 22
except for a way of comparison discriminant functions it is required to specify
algorithm of generation of various combinations of the features included in
discriminant functions. It is supposed, that as such method is chosen the complete
sorting-out of all possible combinations of features.
1. Way of comparison of discriminant functions on the basis of training
and testing subsamples
Suppose that at the step with number s ),...,2,1( ms = of algorithm complete sorting-
out of all possible sets of features only s components from the set X can be included
in the discriminant function and these features form the current set V . In the
following we suppose that IV and IIV are )( Ins× and )( IIns× matrices of
observations from general sets IP and IIP , Iν and IIν are s -dimensional column
vectors of the mathematical expectations in the sets IP and IIP , VΣ is covariance
)( ss× matrix of the sets IP and IIP .
Let's consider the estimation of Mahalanobis distance that is constructed with
account of dividing of observations on training and testing subsamples. We shall
calculate estimations of coefficients discriminant function for set the component V
on the training subsample A and it is used them for estimation Mahalanobis distances
as the relation of an intergroup variation to an intragroup variation on testing
subsample B :
ABA
ABBBBA
AB VD ^
T
^
^
T
II
~
I
~
II
~
I
~
T
^
2 )()()(
dSd
dvvvvd −−
= . (1)
In formula (1), vector A
^
d is an estimate of the coefficients of the Fisher
function that is calculate on training subsamples A
)( II
~
I
~
1
^
AAAA vvSd −= − , (2)
where AI
~
v and AII
~
v are estimate of the mathematical expectation Iν and IIν :
III,,)(
1
1
~
== ∑
=
− kn
kAn
i
kiAkAkA Vv ; (3)
the matrix AS is an unbiased estimate of covariance matrix VΣ
][2)( T
IIII
T
II
1
III AAAAAAA nn vvvvS +−−= − . (4)
In formula (4) )II,I( =kkAv are matrices of deviations of observations kAV
from estimate kA
~
v
],...,,[
~~
2
~
1 kAAknkAAkkAAkkA k
vVvVvVv −−−= . (5)
Sarychev A.P., Sarycheva L.V.
Індуктивне моделювання складних систем, випуск 4, 2012 23
In formula (5) vectors BI
~
v and BII
~
v calculated analogues (3), and matrix BS
calculated analogues (4)–(5); BBAA nnnn IIIIII and,and are volume of training and
testing subsamples respectively, and it is true IIIIIIIII and nnnnnn BABA =+=+ .
Using (2), we obtain for )(2 VDAB :
)()(
)())(()()(
II
~
I
~
11T
II
~
I
~
II
~
I
~
1T
II
~
I
~
II
~
I
~
1T
II
~
I
~
2
AAABAAA
AAABBBBAAA
AB VD
vvSSSvv
vvSvvvvSvv
−−
−−−−
=
−−
−−
. (6)
Let )()( III
1T
III
2 ννΣνν −−=τ −
VV be the Mahalanobis distance for the set V ,
2III −+= AA nnr , )( 1
II
1
I
1 −−− += AAA nnc , )( 1
II
1
I
1 −−− += BBB nnc .
Theorem. For mathematical expectation of random variable )(2 VDAB , we have
1
1)](/)1([)}({ 1
12
12
22
−
−
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−
+
+τ
−−−τ
−τ= −
−
−
r
sr
sr
rc
cs
csrrsVDE B
AV
AV
VAB . (7)
The validity of theorem follows from the validity of the following: 1) the
estimates obtained on subsamples A and B are independent; 2) the estimate (3) and
estimate (4) are independent; 3) matrix AS is random )( ss× matrix which has the
Wishart distribution with r degrees of freedom.
Definition 1. The optimal set components (set features) is defined as the set
OPTV for which
)}({maxarg 2 VDEV AB
XV
OPT
⊆
= . (8)
Definition 2. Optimal discriminant function with respect to the number and
composition of the components is defined as the Fisher discriminant function
constructed on the set of components OPTV .
We proved that optimal set of components exist and formulated the conditions
under which the optimal discriminant function is simplified in number of the features
included in it. For this purpose, it was investigated )}({ VDE 2
AB depending on
composition of set V .
It is possible to divide set of components X into the following nonintersecting
subsets RVRRXX ~~ ooo
UUU == : so that 1) ∅≠
o
X (where ∅ is the empty set) is the
set of components whose mathematical expectation satisfy hh II
o
I
o
χχ ≠ ,
o
2,...,1, mh = ,
where
o
m is their number; 2)
o
R is the set of components whose mathematical
Discriminant functions quality estimation
Індуктивне моделювання складних систем, випуск 4, 2012 24
expectation satisfy
o
II
o
I
o
2,...,1,, lhhh =ρ=ρ , where
o
l is their number and each
component in
o
R depends statistically on the least one components in the set
o
X (the
set
o
R may be empty); 3) R~ is the set of components whose mathematical
expectation satisfy lhIIhIh
~2,...,1,,ρ~ρ~ == , where l~ is number and each component
each component in R~ is statistically independent from each set
o
X (the set R~ may
be empty). Relationship between the Mahalanobis distance for the set components
ooo
RXV U= and the Mahalanobis distance for a current analyzed set of components
XV ⊆ is formulated in the form of lemmas [1-4].
In case of known parameters of general sets IP and IIP it follows from the
stated lemmas that: 1) every component from set
o
X is necessary in the sense that its
inclusion into the current set of components V increase the Mahalanobis distance 2
Vτ ;
2) every component from the set
o
R is necessary in the sense that its inclusion into the
current set of components V increase the Mahalanobis distance 2
Vτ ; 3) every
components from the set R~ is redundant in the sense, that its inclusion into the
current set V does not increase the Mahalanobis distance 2
Vτ .
2. Conditions of reduction (simplification) of optimum discriminant function
As a rule, in practical applications parameters of general populations are unknown;
however they can be estimated as statistical estimates on training samples of
observations of limited volume. It is known, that if we use constructed rule
of classification to the training sample, then estimate of recognition quality will be
overstated by mathematical expectation in comparison with the same evaluation of
quality on data, independent of training data.
The way for comparison of the discriminant functions based on dividing of
the initial data sample on training and testing subsamples give not overstated
estimates of recognition quality. Experience of practical applications and test
investigations of this way on basis of method of statistical test show that in this way:
1) on increase of size of observations samples increases the number of components in
the set, on which the best quality of recognition is attained, and on decrease of size of
observations samples the number of components in such set decreases; 2) on increase
of the Mahalanobis distance 2
Xτ between general populations (from which
observation samples were obtain) the number of components increases in the set, on
which the best quality of recognitions is attained, and on decrease of this distance the
number of components in such set decreases.
Our analytical investigations confirm these empirically determined
regularities about the existence of the discriminant function optimal by the number
Sarychev A.P., Sarycheva L.V.
Індуктивне моделювання складних систем, випуск 4, 2012 25
and composition of components. Let’s formulate the conditions of reduction
(simplification) optimal discriminant function for a special case of an independent
feature. Let the set of V is those, that is carried out
oo
xVX U= , where
oo
Xx∈ (one
feature is missed). Taking into account (7), we receive
=−= )}({)}({)Δ( 2
o
VDEXDEV AB
2
AB
−
−
−
⎟
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎜
⎝
⎛
−
−
+
+τ
−−−τ
−τ −
−
−
1
1)](/)1([ o
o
1
1
o
2
1
oo
2
2
o
o
o r
mr
mr
rc
cm
cmrrm
B
A
X
A
X
X
1
1
1
1
)1(
)]1(/)1()1[(
o
o
1
1
o
2
1
oo
2
2
−
+−
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎝
⎛
+−
−
+
−+τ
+−−−−τ
−τ− −
−
−
r
mr
mr
rc
cm
cmrrm
B
AV
AV
V . (9)
According to the above mentioned lemmas for Mahalanobis distances of sets V and
o
X the ratio 222
o γ−τ=τ
X
V is carried out, where 2
II
o
I
o
22 )(γ o χ−χσ= −
x
is the component
of Mahalanobis distance, that caused by the missed independent feature
oo
Xx∈ . In
view of it, having limited to accuracy )/1( n , neglecting members of the order
)/1( 2n , we receive
⋅
⎥
⎦
⎤
⎢
⎣
⎡
−+⎟
⎠
⎞
⎜
⎝
⎛ γ−τ⋅⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
+τ
=
−− 1
o
221
o
2 )1(
1)Δ(
oo A
X
A
X
cmcm
V ( ) +γ⋅
⎪⎩
⎪
⎨
⎧
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎝
⎛
⋅⋅
−
−
+
−
+−
⋅τ− − 221
o
oo
2
11
1
o A
X
cm
r
mr
r
mr
−γ⋅
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎝
⎛
⋅⋅
−
−
⋅+
−
+−
⋅τ⋅τ+ − 21
o
oo
22
1
2
1
2
oo A
XX
cm
r
mr
r
mr
⎪⎭
⎪
⎬
⎫
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎝
⎛
⋅
−
−
+
−
⋅τ⋅⎟
⎠
⎞
⎜
⎝
⎛τ −1
o
2
2
2
11
1
oo A
XX
c
r
mr
r
.(10)
The value )Δ(V can be both positive, and negative. If 0)Δ( >V the feature
o
x
is necessary for including in discriminant function. If the 0)Δ( <V the
o
x should not
be included in discriminant function as it will lead to decreasing of value 2
ABD , i. e.
addition of an feature
o
x does not improve quality discriminant function by
considered criterion. The condition 0)Δ( <V is a condition of a reduction
(simplification) of discriminant function that is optimal by quantity and structure of
features. This condition represents a condition of negative definiteness of a quadratic
Discriminant functions quality estimation
Індуктивне моделювання складних систем, випуск 4, 2012 26
trinomial relatively 2γ in braces (10). Reduction of discriminant function is possible
when value 2γ below then threshold value
1
o
o
2
1
2
22
1
1
1
)(
o
o
o
−
−
+
⎟⎟
⎟
⎠
⎞
⎜⎜
⎜
⎝
⎛
−
+−
τ
+
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎝
⎛
−
τ
⋅τ=γ
A
X
A
X
X
por
cm
r
mr
c
r
. (11)
In figure dependences of threshold value (11) from volume samples n for a
set of Mahalanobis distance 2
o
X
τ )18,...,8,6( 2
o =τ
X
are submitted at fixed 6
o
=m .
Fig. – Dependences of threshold value por)( 2γ on volume of sabsamples n
Let's note, that in asymptotic at ∞→n ( ∞→r , 01 →−
Ac ) the condition of
the reduction is not carried out, i. e.
o
XVOPT = .
3. Conclusions
The method for comparison of the discriminant functions based on dividing of the
initial data sample on training and testing subsamples is proved. In spite of successful
use of this way in practice and repeated confirmation of its efficiency by the method
of statistical test, it was supposed traditionally as heuristic method. Conditions of
por)( 2γ
n
Sarychev A.P., Sarycheva L.V.
Індуктивне моделювання складних систем, випуск 4, 2012 27
reduction (simplification) of discriminant function which is optimal by structure of
features are revealed. It is obtained that this conditions depend on volumes samples
and parameters of general sets, i.e. on mathematical expectations and covariance
matrices of features. It was shown that under condition of structural uncertainty and
the absence of a priori estimates of parameters of general sets this method make it
possible to solve the problem of search of the discriminant function of optimal
complexity.
References
1. Sarychev A. P. Circuit of Discriminant Analysis with Learning and
Checking Subsamplings of Observations. Automatica (Ukraine), 1: 32–41, 1990
(Journal of Automation and Information Sciences. – Scripta Technika Inc. – Vol. 23,
1: 29–39, 1990).
2. Miroshnichenko L. V., Sarychev A. P. Scheme of the Sliding Exam for
Search of the Optimal Set of Characters in the Problem of Discriminant Analysis.
Automatica (Ukraine), 1: 35–44, 1992 (Journal of Automation and Information
Sciences. – Scripta Technika Inc., Vol. 25, 1: 33–42, 1992).
3. Sarychev A. P. The Solution of the Discriminant Analysis Task in
Conditions of Structural Uncertainty on Basis of the Group Method of Data
Handling. Problemy Upravlenia i Informatiki (Ukraine), 3: 100–112, 2008 (Journal of
Automation and Information Sciences. – Begell House Inc., Vol. 40, 6: 27–40, 2008).
4. Sarychev A. P. Identification of Structural-Uncertain Systems States. The
book. Dnipropetrovs’k, Ukraine, NAS Ukraine and NSA Ukraine, Institute
of Technical Mechanics, 268, 2008.
5. Sarychev A. P., L. V. Sarycheva. The Optimal Set Features Determination
in Discriminant Analysis by the Group Method of Data Handling // Systems Analysis
and Modeling Simulation (SAMS), Overseas Publishers Association. – 1998. –
Vol. 31. – P. 153–167.
6. Sarychev A. P. S-Scheme of Sliding Examination for Optimal Set Features
Determination in Discriminant Analysis by the Group Method of Data Handling
/ A. P. Sarychev // System Analysis and Modelling Simulation (SAMS),
Taylor & Francis. – 2003. – Vol. 43. – № 10. – P. 1351–1362.
7. Sarychev A. P., Sarycheva L. V. Quality Estimation of Discriminant
Functions by Sliding Examination // The Forth Workshop on Inductive Modelling
(IWIM–2011) : July 4–8 2011, Kyiv, Ukraine : Proc. – Kyiv : IRTC ITS, 2011. –
ISBN 978-966-02-6078-8. – P. 104–108.
|