Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples

The way of comparison of discriminant functions with dividing samples observations on training and testing subsamples is proved. Conditions of existence of optimum set of features which depend on parameters of general sets and volumes samples are received. Laws of simplification of optimum discrimin...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2012
Автори: Sarychev, A., Sarycheva, L.
Формат: Стаття
Мова:English
Опубліковано: Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України 2012
Назва видання:Індуктивне моделювання складних систем
Онлайн доступ:http://dspace.nbuv.gov.ua/handle/123456789/45954
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Digital Library of Periodicals of National Academy of Sciences of Ukraine
Цитувати:Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples / A. Sarychev, L. Sarycheva // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2012. — Вип. 4. — С. 21-27. — Бібліогр.: 7 назв. — англ.

Репозитарії

Digital Library of Periodicals of National Academy of Sciences of Ukraine
id irk-123456789-45954
record_format dspace
spelling irk-123456789-459542013-06-22T03:16:18Z Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples Sarychev, A. Sarycheva, L. The way of comparison of discriminant functions with dividing samples observations on training and testing subsamples is proved. Conditions of existence of optimum set of features which depend on parameters of general sets and volumes samples are received. Laws of simplification of optimum discriminant function at decrease of volumes samples and at increase of dispersions of features are revealed. Обґрунтовано спосіб порівняння дискримінантних функцій з розбиттям вибірок спостережень на навчальні й перевірні підвибірки. Отримано умови існування оптимальної множини ознак, які залежать від параметрів генеральних сукупностей і обсягів вибірок. Виявлено закономірності спрощення оптимальної дискримінантної функції при зменшенні обсягів вибірок і при збільшенні дисперсій ознак. Обоснован способ сравнения дискриминантных функций с разбиением выборок наблюдений на обучающие и проверочные подвыборки. Получены условия существования оптимального множества признаков, которые зависят от параметров генеральных совокупностей и объемов выборок. Выявлены закономерности упрощения оптимальной дискриминантной функции при уменьшении объемов выборок и при увеличении дисперсий признаков. 2012 Article Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples / A. Sarychev, L. Sarycheva // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2012. — Вип. 4. — С. 21-27. — Бібліогр.: 7 назв. — англ. XXXX-0044 http://dspace.nbuv.gov.ua/handle/123456789/45954 519.25 en Індуктивне моделювання складних систем Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України
institution Digital Library of Periodicals of National Academy of Sciences of Ukraine
collection DSpace DC
language English
description The way of comparison of discriminant functions with dividing samples observations on training and testing subsamples is proved. Conditions of existence of optimum set of features which depend on parameters of general sets and volumes samples are received. Laws of simplification of optimum discriminant function at decrease of volumes samples and at increase of dispersions of features are revealed.
format Article
author Sarychev, A.
Sarycheva, L.
spellingShingle Sarychev, A.
Sarycheva, L.
Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples
Індуктивне моделювання складних систем
author_facet Sarychev, A.
Sarycheva, L.
author_sort Sarychev, A.
title Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples
title_short Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples
title_full Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples
title_fullStr Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples
title_full_unstemmed Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples
title_sort discriminant functions quality estimation on the basis of training and testing samples
publisher Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України
publishDate 2012
url http://dspace.nbuv.gov.ua/handle/123456789/45954
citation_txt Discriminant Functions Quality Estimation on the Basis of Training and Testing Samples / A. Sarychev, L. Sarycheva // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2012. — Вип. 4. — С. 21-27. — Бібліогр.: 7 назв. — англ.
series Індуктивне моделювання складних систем
work_keys_str_mv AT sarycheva discriminantfunctionsqualityestimationonthebasisoftrainingandtestingsamples
AT sarycheval discriminantfunctionsqualityestimationonthebasisoftrainingandtestingsamples
first_indexed 2025-07-04T05:00:09Z
last_indexed 2025-07-04T05:00:09Z
_version_ 1836691187443957760
fulltext Sarychev A.P., Sarycheva L.V. Індуктивне моделювання складних систем, випуск 4, 2012 21 УДК 519.25 DISCRIMINANT FUNCTIONS QUALITY ESTIMATION ON THE BASIS OF TRAINING AND TESTING SAMPLES Alexander Sarychev 1 and Lyudmyla Sarycheva 2 1 Institute of Technical Mechanics of the National Academy of Sciences of Ukraine, 15 Leshko-Popel St., Dnipropetrovs’k, 49005, Ukraine 2 National Mining University of Ukraine, 19 K. Marks Ave., Dnipropetrovs’k, 49027, Ukraine Sarychev@prognoz.dp.ua, Sarycheval@nmu.org.ua Обґрунтовано спосіб порівняння дискримінантних функцій з розбиттям вибірок спостережень на навчальні й перевірні підвибірки. Отримано умови існування оптимальної множини ознак, які залежать від параметрів генеральних сукупностей і обсягів вибірок. Виявлено закономірності спрощення оптимальної дискримінантної функції при зменшенні обсягів вибірок і при збільшенні дисперсій ознак. Ключові слова: метод групового урахування аргументів, невизначеність за складом ознак, критерій якості лінійної дискримінантної функції. The way of comparison of discriminant functions with dividing samples observations on training and testing subsamples is proved. Conditions of existence of optimum set of features which depend on parameters of general sets and volumes samples are received. Laws of simplification of optimum discriminant function at decrease of volumes samples and at increase of dispersions of features are revealed. Keywords: Group Method of Data Handling, uncertainty on structure of features, criterion of linear discriminant function quality. Обоснован способ сравнения дискриминантных функций с разбиением выборок наблюдений на обучающие и проверочные подвыборки. Получены условия существования оптимального множества признаков, которые зависят от параметров генеральных совокупностей и объемов выборок. Выявлены закономерности упрощения оптимальной дискриминантной функции при уменьшении объемов выборок и при увеличении дисперсий признаков. Ключевые слова: метод группового учета аргументов, неопределенность по составу признаков, критерий качества линейной дискриминантной функции. Introduction The decision of task of the discriminant analysis in conditions of structural uncertainty on structure of features assumes acceptance of any way of comparison of discriminant functions which are constructed on various sets of features. Two ways of comparison are popular in practice. The first way is based on dividing of observations on training and testing subsamples. In this way training subsamples are used for estimation coefficients of discriminant functions, and testing subsamples are used for estimation its qualities of classification. The second way is sliding examination. In this way observations which are serially excluded from training subsamples are used as testing observations. In the literature these ways are traditionally treated as heuristic methods though the fact of existence in them of optimum set of features repeatedly proved by a method of statistical tests. In the Group Method of Data Handling (GMDH) analytical research of these two ways is carried out [1-4]. For the decision of a task of the discriminant analysis in conditions of structural uncertainty Discriminant functions quality estimation Індуктивне моделювання складних систем, випуск 4, 2012 22 except for a way of comparison discriminant functions it is required to specify algorithm of generation of various combinations of the features included in discriminant functions. It is supposed, that as such method is chosen the complete sorting-out of all possible combinations of features. 1. Way of comparison of discriminant functions on the basis of training and testing subsamples Suppose that at the step with number s ),...,2,1( ms = of algorithm complete sorting- out of all possible sets of features only s components from the set X can be included in the discriminant function and these features form the current set V . In the following we suppose that IV and IIV are )( Ins× and )( IIns× matrices of observations from general sets IP and IIP , Iν and IIν are s -dimensional column vectors of the mathematical expectations in the sets IP and IIP , VΣ is covariance )( ss× matrix of the sets IP and IIP . Let's consider the estimation of Mahalanobis distance that is constructed with account of dividing of observations on training and testing subsamples. We shall calculate estimations of coefficients discriminant function for set the component V on the training subsample A and it is used them for estimation Mahalanobis distances as the relation of an intergroup variation to an intragroup variation on testing subsample B : ABA ABBBBA AB VD ^ T ^ ^ T II ~ I ~ II ~ I ~ T ^ 2 )()()( dSd dvvvvd −− = . (1) In formula (1), vector A ^ d is an estimate of the coefficients of the Fisher function that is calculate on training subsamples A )( II ~ I ~ 1 ^ AAAA vvSd −= − , (2) where AI ~ v and AII ~ v are estimate of the mathematical expectation Iν and IIν : III,,)( 1 1 ~ == ∑ = − kn kAn i kiAkAkA Vv ; (3) the matrix AS is an unbiased estimate of covariance matrix VΣ ][2)( T IIII T II 1 III AAAAAAA nn vvvvS +−−= − . (4) In formula (4) )II,I( =kkAv are matrices of deviations of observations kAV from estimate kA ~ v ],...,,[ ~~ 2 ~ 1 kAAknkAAkkAAkkA k vVvVvVv −−−= . (5) Sarychev A.P., Sarycheva L.V. Індуктивне моделювання складних систем, випуск 4, 2012 23 In formula (5) vectors BI ~ v and BII ~ v calculated analogues (3), and matrix BS calculated analogues (4)–(5); BBAA nnnn IIIIII and,and are volume of training and testing subsamples respectively, and it is true IIIIIIIII and nnnnnn BABA =+=+ . Using (2), we obtain for )(2 VDAB : )()( )())(()()( II ~ I ~ 11T II ~ I ~ II ~ I ~ 1T II ~ I ~ II ~ I ~ 1T II ~ I ~ 2 AAABAAA AAABBBBAAA AB VD vvSSSvv vvSvvvvSvv −− −−−− = −− −− . (6) Let )()( III 1T III 2 ννΣνν −−=τ − VV be the Mahalanobis distance for the set V , 2III −+= AA nnr , )( 1 II 1 I 1 −−− += AAA nnc , )( 1 II 1 I 1 −−− += BBB nnc . Theorem. For mathematical expectation of random variable )(2 VDAB , we have 1 1)](/)1([)}({ 1 12 12 22 − − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − − + +τ −−−τ −τ= − − − r sr sr rc cs csrrsVDE B AV AV VAB . (7) The validity of theorem follows from the validity of the following: 1) the estimates obtained on subsamples A and B are independent; 2) the estimate (3) and estimate (4) are independent; 3) matrix AS is random )( ss× matrix which has the Wishart distribution with r degrees of freedom. Definition 1. The optimal set components (set features) is defined as the set OPTV for which )}({maxarg 2 VDEV AB XV OPT ⊆ = . (8) Definition 2. Optimal discriminant function with respect to the number and composition of the components is defined as the Fisher discriminant function constructed on the set of components OPTV . We proved that optimal set of components exist and formulated the conditions under which the optimal discriminant function is simplified in number of the features included in it. For this purpose, it was investigated )}({ VDE 2 AB depending on composition of set V . It is possible to divide set of components X into the following nonintersecting subsets RVRRXX ~~ ooo UUU == : so that 1) ∅≠ o X (where ∅ is the empty set) is the set of components whose mathematical expectation satisfy hh II o I o χχ ≠ , o 2,...,1, mh = , where o m is their number; 2) o R is the set of components whose mathematical Discriminant functions quality estimation Індуктивне моделювання складних систем, випуск 4, 2012 24 expectation satisfy o II o I o 2,...,1,, lhhh =ρ=ρ , where o l is their number and each component in o R depends statistically on the least one components in the set o X (the set o R may be empty); 3) R~ is the set of components whose mathematical expectation satisfy lhIIhIh ~2,...,1,,ρ~ρ~ == , where l~ is number and each component each component in R~ is statistically independent from each set o X (the set R~ may be empty). Relationship between the Mahalanobis distance for the set components ooo RXV U= and the Mahalanobis distance for a current analyzed set of components XV ⊆ is formulated in the form of lemmas [1-4]. In case of known parameters of general sets IP and IIP it follows from the stated lemmas that: 1) every component from set o X is necessary in the sense that its inclusion into the current set of components V increase the Mahalanobis distance 2 Vτ ; 2) every component from the set o R is necessary in the sense that its inclusion into the current set of components V increase the Mahalanobis distance 2 Vτ ; 3) every components from the set R~ is redundant in the sense, that its inclusion into the current set V does not increase the Mahalanobis distance 2 Vτ . 2. Conditions of reduction (simplification) of optimum discriminant function As a rule, in practical applications parameters of general populations are unknown; however they can be estimated as statistical estimates on training samples of observations of limited volume. It is known, that if we use constructed rule of classification to the training sample, then estimate of recognition quality will be overstated by mathematical expectation in comparison with the same evaluation of quality on data, independent of training data. The way for comparison of the discriminant functions based on dividing of the initial data sample on training and testing subsamples give not overstated estimates of recognition quality. Experience of practical applications and test investigations of this way on basis of method of statistical test show that in this way: 1) on increase of size of observations samples increases the number of components in the set, on which the best quality of recognition is attained, and on decrease of size of observations samples the number of components in such set decreases; 2) on increase of the Mahalanobis distance 2 Xτ between general populations (from which observation samples were obtain) the number of components increases in the set, on which the best quality of recognitions is attained, and on decrease of this distance the number of components in such set decreases. Our analytical investigations confirm these empirically determined regularities about the existence of the discriminant function optimal by the number Sarychev A.P., Sarycheva L.V. Індуктивне моделювання складних систем, випуск 4, 2012 25 and composition of components. Let’s formulate the conditions of reduction (simplification) optimal discriminant function for a special case of an independent feature. Let the set of V is those, that is carried out oo xVX U= , where oo Xx∈ (one feature is missed). Taking into account (7), we receive =−= )}({)}({)Δ( 2 o VDEXDEV AB 2 AB − − − ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ − − + +τ −−−τ −τ − − − 1 1)](/)1([ o o 1 1 o 2 1 oo 2 2 o o o r mr mr rc cm cmrrm B A X A X X 1 1 1 1 )1( )]1(/)1()1[( o o 1 1 o 2 1 oo 2 2 − +− ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ +− − + −+τ +−−−−τ −τ− − − − r mr mr rc cm cmrrm B AV AV V . (9) According to the above mentioned lemmas for Mahalanobis distances of sets V and o X the ratio 222 o γ−τ=τ X V is carried out, where 2 II o I o 22 )(γ o χ−χσ= − x is the component of Mahalanobis distance, that caused by the missed independent feature oo Xx∈ . In view of it, having limited to accuracy )/1( n , neglecting members of the order )/1( 2n , we receive ⋅ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ −+⎟ ⎠ ⎞ ⎜ ⎝ ⎛ γ−τ⋅⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ +τ = −− 1 o 221 o 2 )1( 1)Δ( oo A X A X cmcm V ( ) +γ⋅ ⎪⎩ ⎪ ⎨ ⎧ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⋅⋅ − − + − +− ⋅τ− − 221 o oo 2 11 1 o A X cm r mr r mr −γ⋅ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⋅⋅ − − ⋅+ − +− ⋅τ⋅τ+ − 21 o oo 22 1 2 1 2 oo A XX cm r mr r mr ⎪⎭ ⎪ ⎬ ⎫ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⋅ − − + − ⋅τ⋅⎟ ⎠ ⎞ ⎜ ⎝ ⎛τ −1 o 2 2 2 11 1 oo A XX c r mr r .(10) The value )Δ(V can be both positive, and negative. If 0)Δ( >V the feature o x is necessary for including in discriminant function. If the 0)Δ( <V the o x should not be included in discriminant function as it will lead to decreasing of value 2 ABD , i. e. addition of an feature o x does not improve quality discriminant function by considered criterion. The condition 0)Δ( <V is a condition of a reduction (simplification) of discriminant function that is optimal by quantity and structure of features. This condition represents a condition of negative definiteness of a quadratic Discriminant functions quality estimation Індуктивне моделювання складних систем, випуск 4, 2012 26 trinomial relatively 2γ in braces (10). Reduction of discriminant function is possible when value 2γ below then threshold value 1 o o 2 1 2 22 1 1 1 )( o o o − − + ⎟⎟ ⎟ ⎠ ⎞ ⎜⎜ ⎜ ⎝ ⎛ − +− τ + ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ − τ ⋅τ=γ A X A X X por cm r mr c r . (11) In figure dependences of threshold value (11) from volume samples n for a set of Mahalanobis distance 2 o X τ )18,...,8,6( 2 o =τ X are submitted at fixed 6 o =m . Fig. – Dependences of threshold value por)( 2γ on volume of sabsamples n Let's note, that in asymptotic at ∞→n ( ∞→r , 01 →− Ac ) the condition of the reduction is not carried out, i. e. o XVOPT = . 3. Conclusions The method for comparison of the discriminant functions based on dividing of the initial data sample on training and testing subsamples is proved. In spite of successful use of this way in practice and repeated confirmation of its efficiency by the method of statistical test, it was supposed traditionally as heuristic method. Conditions of por)( 2γ n Sarychev A.P., Sarycheva L.V. Індуктивне моделювання складних систем, випуск 4, 2012 27 reduction (simplification) of discriminant function which is optimal by structure of features are revealed. It is obtained that this conditions depend on volumes samples and parameters of general sets, i.e. on mathematical expectations and covariance matrices of features. It was shown that under condition of structural uncertainty and the absence of a priori estimates of parameters of general sets this method make it possible to solve the problem of search of the discriminant function of optimal complexity. References 1. Sarychev A. P. Circuit of Discriminant Analysis with Learning and Checking Subsamplings of Observations. Automatica (Ukraine), 1: 32–41, 1990 (Journal of Automation and Information Sciences. – Scripta Technika Inc. – Vol. 23, 1: 29–39, 1990). 2. Miroshnichenko L. V., Sarychev A. P. Scheme of the Sliding Exam for Search of the Optimal Set of Characters in the Problem of Discriminant Analysis. Automatica (Ukraine), 1: 35–44, 1992 (Journal of Automation and Information Sciences. – Scripta Technika Inc., Vol. 25, 1: 33–42, 1992). 3. Sarychev A. P. The Solution of the Discriminant Analysis Task in Conditions of Structural Uncertainty on Basis of the Group Method of Data Handling. Problemy Upravlenia i Informatiki (Ukraine), 3: 100–112, 2008 (Journal of Automation and Information Sciences. – Begell House Inc., Vol. 40, 6: 27–40, 2008). 4. Sarychev A. P. Identification of Structural-Uncertain Systems States. The book. Dnipropetrovs’k, Ukraine, NAS Ukraine and NSA Ukraine, Institute of Technical Mechanics, 268, 2008. 5. Sarychev A. P., L. V. Sarycheva. The Optimal Set Features Determination in Discriminant Analysis by the Group Method of Data Handling // Systems Analysis and Modeling Simulation (SAMS), Overseas Publishers Association. – 1998. – Vol. 31. – P. 153–167. 6. Sarychev A. P. S-Scheme of Sliding Examination for Optimal Set Features Determination in Discriminant Analysis by the Group Method of Data Handling / A. P. Sarychev // System Analysis and Modelling Simulation (SAMS), Taylor & Francis. – 2003. – Vol. 43. – № 10. – P. 1351–1362. 7. Sarychev A. P., Sarycheva L. V. Quality Estimation of Discriminant Functions by Sliding Examination // The Forth Workshop on Inductive Modelling (IWIM–2011) : July 4–8 2011, Kyiv, Ukraine : Proc. – Kyiv : IRTC ITS, 2011. – ISBN 978-966-02-6078-8. – P. 104–108.