Automatic speech recognition system for controlling a robotic system using Romanian

In this paper, we describe a proposed Automatic Speech Recognition system that can be used to give vocal commands to a robotic system. The system was specially created to use Romanian as the main language with respect to its particularities: specific phonetic rules and large variety of accents. The...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Datum:2008
Hauptverfasser: Goloca, A., Pentiuc, S.G.
Format: Artikel
Sprache:English
Veröffentlicht: Інститут фізики напівпровідників імені В.Є. Лашкарьова НАН України 2008
Schriftenreihe:Оптико-електронні інформаційно-енергетичні технології
Schlagworte:
Online Zugang:http://dspace.nbuv.gov.ua/handle/123456789/32189
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Назва журналу:Digital Library of Periodicals of National Academy of Sciences of Ukraine
Zitieren:Automatic speech recognition system for controlling a robotic system using Romanian / A. Goloca, S.G. Pentiuc // Оптико-електронні інформаційно-енергетичні технології. — 2008. — № 2 (16). — С. 109-113. — Бібліогр.: 12 назв. — англ.

Institution

Digital Library of Periodicals of National Academy of Sciences of Ukraine
id irk-123456789-32189
record_format dspace
spelling irk-123456789-321892012-04-14T12:18:23Z Automatic speech recognition system for controlling a robotic system using Romanian Goloca, A. Pentiuc, S.G. Системи технічного зору і штучного інтелекту з обробкою та розпізнаванням зображень In this paper, we describe a proposed Automatic Speech Recognition system that can be used to give vocal commands to a robotic system. The system was specially created to use Romanian as the main language with respect to its particularities: specific phonetic rules and large variety of accents. The main purpose of the system is to act as a touch-less interface for a demonstrative robotic system. This interface was designed to allow a remote human user to give simple commands to the robotic system without any physical contact with the robot. There are more ways to create an Automatic Speech Recognition System but, given the complexity of the task, the methods that were used had to be reliable and this is the reason while the Hidden Markov Models approach was chosen. The current status of the project allows a remote user to give simple commands to the experimental robot using a microphone, a laptop or a PDA and wireless connection. 2008 Article Automatic speech recognition system for controlling a robotic system using Romanian / A. Goloca, S.G. Pentiuc // Оптико-електронні інформаційно-енергетичні технології. — 2008. — № 2 (16). — С. 109-113. — Бібліогр.: 12 назв. — англ. 1681-7893 http://dspace.nbuv.gov.ua/handle/123456789/32189 en Оптико-електронні інформаційно-енергетичні технології Інститут фізики напівпровідників імені В.Є. Лашкарьова НАН України
institution Digital Library of Periodicals of National Academy of Sciences of Ukraine
collection DSpace DC
language English
topic Системи технічного зору і штучного інтелекту з обробкою та розпізнаванням зображень
Системи технічного зору і штучного інтелекту з обробкою та розпізнаванням зображень
spellingShingle Системи технічного зору і штучного інтелекту з обробкою та розпізнаванням зображень
Системи технічного зору і штучного інтелекту з обробкою та розпізнаванням зображень
Goloca, A.
Pentiuc, S.G.
Automatic speech recognition system for controlling a robotic system using Romanian
Оптико-електронні інформаційно-енергетичні технології
description In this paper, we describe a proposed Automatic Speech Recognition system that can be used to give vocal commands to a robotic system. The system was specially created to use Romanian as the main language with respect to its particularities: specific phonetic rules and large variety of accents. The main purpose of the system is to act as a touch-less interface for a demonstrative robotic system. This interface was designed to allow a remote human user to give simple commands to the robotic system without any physical contact with the robot. There are more ways to create an Automatic Speech Recognition System but, given the complexity of the task, the methods that were used had to be reliable and this is the reason while the Hidden Markov Models approach was chosen. The current status of the project allows a remote user to give simple commands to the experimental robot using a microphone, a laptop or a PDA and wireless connection.
format Article
author Goloca, A.
Pentiuc, S.G.
author_facet Goloca, A.
Pentiuc, S.G.
author_sort Goloca, A.
title Automatic speech recognition system for controlling a robotic system using Romanian
title_short Automatic speech recognition system for controlling a robotic system using Romanian
title_full Automatic speech recognition system for controlling a robotic system using Romanian
title_fullStr Automatic speech recognition system for controlling a robotic system using Romanian
title_full_unstemmed Automatic speech recognition system for controlling a robotic system using Romanian
title_sort automatic speech recognition system for controlling a robotic system using romanian
publisher Інститут фізики напівпровідників імені В.Є. Лашкарьова НАН України
publishDate 2008
topic_facet Системи технічного зору і штучного інтелекту з обробкою та розпізнаванням зображень
url http://dspace.nbuv.gov.ua/handle/123456789/32189
citation_txt Automatic speech recognition system for controlling a robotic system using Romanian / A. Goloca, S.G. Pentiuc // Оптико-електронні інформаційно-енергетичні технології. — 2008. — № 2 (16). — С. 109-113. — Бібліогр.: 12 назв. — англ.
series Оптико-електронні інформаційно-енергетичні технології
work_keys_str_mv AT golocaa automaticspeechrecognitionsystemforcontrollingaroboticsystemusingromanian
AT pentiucsg automaticspeechrecognitionsystemforcontrollingaroboticsystemusingromanian
first_indexed 2025-07-03T12:43:07Z
last_indexed 2025-07-03T12:43:07Z
_version_ 1836629715401572352
fulltext 5 ALEXANDRU GOLOCA, PENTIUC STEFAN-GHEORGHE AUTOMATIC SPEECH RECOGNITION SYSTEM FOR CONTROLLING A ROBOTIC SYSTEM USING ROMANIAN The “Stefan cel Mare” University, 13 th , University Street, Romania, tel.: +40-230-216147, E-mail: alexg@eed.usv.ro Abstract. In this paper, we describe a proposed Automatic Speech Recognition system that can be used to give vocal commands to a robotic system. The system was specially created to use Romanian as the main language with respect to its particularities: specific phonetic rules and large variety of accents. The main purpose of the system is to act as a touch-less interface for a demonstrative robotic system. This interface was designed to allow a remote human user to give simple commands to the robotic system without any physical contact with the robot. There are more ways to create an Automatic Speech Recognition System but, given the complexity of the task, the methods that were used had to be reliable and this is the reason while the Hidden Markov Models approach was chosen. The current status of the project allows a remote user to give simple commands to the experimental robot using a microphone, a laptop or a PDA and wireless connection. Key words: Automatic Speech Recognition System, Hidden Markov Models approach. INTRODUCTION Computer Speech recognition, also known as Automatic Speech recognition (ASR) enables a computer, or a computer guided device to recognize spoken words in an automated way, with no human aid. ASR offers the possibility to create new touch-less interfaces for computers and this further opens the door for new developments in many areas of interest. One particular area is the medical field, where speech recognition techniques can be implemented on high-end surgical equipment and another is robotics, where a human can control a robot using nothing but the voice. THE CURRENT STATUS All Speech recognition consists of two problems. They are not completely separated but have different degree of difficulty, as presented in 5 and 9. The first and “simple” problem is the isolated word recognition. The problem is not really that simple. It assumes that there is an utterance which is known to contain a single word. The requirement is to identify the one word contained in the spoken data. The second and most complex problem is contextual speech recognition. This involves recognizing words inside a proposition, propositions inside a phrase and even phrases. This “complex problem” is not discussed in this paper but it should be mentioned that this approach is used by complex commercial ASRs. The Accuracy of the system is one key factor and it is measured in word error rate.  ALEXANDRU GOLOCA, PENTIUC STEFAN-GHEORGHE, 2008 ПРИНЦИПОВІ КОНЦЕПЦІЇ ТА СТРУКТУРУВАННЯ РІЗНИХ РІВНІВ ОСВІТИ З ОПТИКО-ЕЛЕКТРОННИХ ІНФОРМАЦІЙНО- ЕНЕРГЕТИЧНИХ ТЕХНОЛОГІЙ 6 Another classification is made from the degree of independence that the system possesses, or how much do the accuracy of the system depends on the user: • User dependent speech recognition systems: they offer good results for a particular speaker (the one that was used for training the system) but less good results if another person is using the system. If the other person has very different voice parameters, the results are poor. • User independent speech recognition systems: they offer good results for any category of speakers (from a global point of view) but not very good results for particular categories of speakers. There are three main speech recognition techniques (as presented in 4) that have been developed during the past years and some of them are used to develop hybrid models. These main techniques are: 1. Dynamic Time Warping (DTW); 2. Artificial Neural Networks (ANN); 3. The Hidden Markov Model (HMM); 4. Hybrid models. DYNAMIC TIME WARPING This is, according to 4, the first technique used but it offers poor results in ASR. It is important only from a historical point of view and is not used in nowadays large scale applications. ARTIFICIAL NEURAL NETWORKS If Artificial Neural Networks are to be used as a tool for Automatic Speech Recognition, the input dataset consists of a series of speech vectors. These vectors contain relevant information about a pronounced word. A training set is presented to the system and the system is adjusting itself in such a way that would generate a correct classification of the input data (the word that was pronounced is already known for the training set). After that was done, the system is able to receive test data (the words that need classification) and identify the word that was pronounced. HIDDEN MARKOV MODEL The approach based on Hidden Markov Models represents the main choice for speech recognition applications in our days (according to 4, 5 and 6) and it seems that future developments are still possible. HMMs are used in many types of applications where there is a need to classify objects. These applications can vary from speech recognition to handwriting recognition, posture recognition and so on. A HMM represents a statistic model in which the modeled system is assumed to be a “Markov Process” with unknown parameters. The main challenge is to compute the hidden parameters knowing only the visible parameters (those that can be observed). The extracted parameters can be used after that for several different purposes, but the main goal is to build a system that is able to classify objects. From the ASR point of view spoken words are objects that have to be associated with a class and Hidden Markov Models are able to perform that. ПРИНЦИПОВІ КОНЦЕПЦІЇ ТА СТРУКТУРУВАННЯ РІЗНИХ РІВНІВ ОСВІТИ З ОПТИКО-ЕЛЕКТРОННИХ ІНФОРМАЦІЙНО- ЕНЕРГЕТИЧНИХ ТЕХНОЛОГІЙ 7 A typical HMM was illustrated in Fig. 1. The meaning of its parameters is: • x - Hidden states: these cannot be directly observed but their effect exists and can be observed directly; • y - System outputs: these can be directly observed and represent the effect of internal state transitions; • a - Transition probabilities: these represent the probabilities for the system to pass from state to state; • b - Output probabilities: these represent the probabilities to obtain certain values for system outputs. Fig. 1. A HMM and the meaning of its parameters There are three main problems (challenges) involving HMMs: 1. The model parameters are given and the requirement is to compute the probability to obtain a certain sequence as system output. This problem is solved using an algorithm called “The forward-backward procedure”, an advanced algorithm which comes from the “dynamic programming” family of algorithms. 2. The model parameters are given and the requirement is to compute the most plausible sequence of states that generated the observed output sequence. This problem is solved using the Viterbi algorithm (presented in 10 and 11). This algorithm allows retracing the most plausible sequence of states that has been used in the training stage. This is the “recognition problem” or the classification. 3. One output sequence is given, or even a set of sequences and the requirement is to compute the most likely internal transitions and output probabilities that allowed for the output sequence to occur. This has the meaning of “training the model”. This problem is solved by the Baum-Welch algorithm (as described in 12). This algorithm adjusts the requested probabilities until the output sequence coincides with the desired one. HYBRID MODELS These are the “top models” and they offer probably the best results, being used in commercial applications. They are based on both ANN and HMM, according to 4 and 5. An Artificial Neural Network is used to identify parts of a word (called phonemes). After the phonemes have been identified, they feed multiple HMMs. These HMMs which will compute the most probable word made from those phonemes. More detailed: several HMMs are used: there is one for every word that could possibly be recognized. All models offer a probability (the probability that the observed word was generated by the current model) and the maximum value from all those models will indicate the most likely word (the model that really produces that word is expected to offer a very high value for that probability). A PROPOSED SYSTEM BASED ON HMM This proposed system uses the following steps in order to accomplish classification of the spoken words: 1. Sound acquisition: this step is made of more sub-stages, like: converting the sound waves into electric analogue signal (done with a microphone), followed by the conversion of analogue signal to discrete numeric values using an ADC (Analogue to Digital Converter) circuit. At this step is very important that the surrounding environment has a low noise level; 2. Preprocessing raw data: also consists of sub-stages: ПРИНЦИПОВІ КОНЦЕПЦІЇ ТА СТРУКТУРУВАННЯ РІЗНИХ РІВНІВ ОСВІТИ З ОПТИКО-ЕЛЕКТРОННИХ ІНФОРМАЦІЙНО- ЕНЕРГЕТИЧНИХ ТЕХНОЛОГІЙ 8 • Signal normalization and segmentation: the level of noise is detected using different methods, and considered to represent “silence signal”. The signal then is sliced in small portions called windows or frames. These may contain data representing from 20 to 50 milliseconds from the digital signal. • Digital filtering makes sure that all spikes appearing at the beginning and the end of each frame will not affect the final result. In order to obtain that effect, a multiplication with a Hamming window is used. After the Hamming filtering cut the unwanted spikes from the beginning and the end of each frame, it’s time for another filter: a low-pass filter. • Word boundaries detection: At this step words are being separated from silence in order to be extracted for analysis. 3. Feature extraction: at this very important stage one must choose what is meaningful from the preprocessed signal. This is needed because using the entire signal from a frame is not going to provide enough useful information and hence an extraction method is needed. There are some relevant parameters that could be used but the most popular approaches are: • Linear Prediction Coding (LPC): uses some polynomial approximations but is not a very popular method; • Cepstral Processing: a very popular method which uses Mel Fourier Cepstral Coefficients (MFCC), described in 13 and was used in 14. Obtaining these coefficients means using many complicated operations like Discrete Fourier Transform (as presented in 8), Logarithmical operations and Discrete Cosine Transform. The required steps are presented in Fig. 2 Fig. 2. Obtaining the MFCCs Training the HMM is achieved using the Baum-Welch algorithm. Inner parameters of the HMM are adjusted until it will generate the desired output sequence. A codebook can be used at this step and it ensures a very fast operating when it will come to actually recognizing the word (see next step). In order to build the codebook Vector Quantization can be performed on the speech vectors extracted at the Feature Extraction step. A codebook contains a number of points from a n- size space and a number of centroids (weight centers, also n-sized points); Classification using HMMs. This is done using the Viterbi algorithm. THE CURRENT PROJECT. PROPOSED EXTENSIONS The goal of the current project is to create a system capable of offering a similarity measure for a spoken word. More precise, a person is supposed to pronounce a certain command word and the system is supposed to evaluate the pronunciation and decide what the given command was. Further, the command is sent to a robot for execution (e.g. change of direction, acceleration and so on). ПРИНЦИПОВІ КОНЦЕПЦІЇ ТА СТРУКТУРУВАННЯ РІЗНИХ РІВНІВ ОСВІТИ З ОПТИКО-ЕЛЕКТРОННИХ ІНФОРМАЦІЙНО- ЕНЕРГЕТИЧНИХ ТЕХНОЛОГІЙ 9 Most of the methods already presented are being used in this project. The chosen software platform was Java, due to its high portability (can work on PCs as well as on embedded systems that offer Java support) and very rich libraries (known as packages). The implemented training stages (up until now) are: • Sound acquisition; • Windowing is the action of slicing the signal in equal sized frames. This is done using overlapping windows. • Digital filtering: a Hamming window and a Low-pass filter are used for each window in order to ensure a clean signal for the following steps. • Word boundary detection is done using the zero-crossing rate with threshold method since it offers good results and is able to deal with both low-frequency and medium-frequency noise. • Feature extraction is performed using Cepstral Processing. After a complex succession of steps, 12 Mel Fourier Cepstral Coefficients are extracted from a frame of 512 elements. • A codebook is built for every trained word using Vector Quantization (VQ) applied on the MFCCs already obtained. In order to build the codebook two algorithms had to be used K-means and Binary Split; • Codebooks can be stored on files after they have been built and they can be loaded whenever they are needed (training has to be done just once for a training set, not each time recognition is needed). The implemented recognition steps are as follows (in italics those that are the same as for training): • Sound acquisition; • Windowing; • Digital filtering; • Word boundary detection; • Feature extraction; • Distance computing and recognition. At this final step a vector containing relevant speech data (the Mel Fourier Cepstral Coefficients) is confronted with all existing codebooks, previously loaded from files. A distortion is calculated for each codebook. This is the sum of Euclidian Distances between the points belonging to the current word and those belonging to the centroids in the codebook. After computing all the distortions, the minimum distortion is found and the codebook that featured that distortion is assumed to indicate the most likely word. Using other kinds of distances, experiments have been performed, in order to find whether they would provide a better measure of similarity. Some distances used were: Manhattan Distance, Cebisev Distance or Mahalanobis Distance. CONCLUSIONS There are some differences between the classical speech recognition problem and the current project. The most important ones could be: • The need for an accurate recognition: Unlike dictation software that would simply write text, this system is supposed to control the actions of a robot using commands. This gives it great responsibilities since its actions might affect people (e.g. the robot could hit people while moving as a result of deficient understanding of commands). • There is the need to expand the number of possible commands in the future when the complexity of the robot might increase. This should be done without rewriting the speech recognition software. • There are a number of difficulties that arise from the fact that the main used language is not English but Romanian, as it had been shown in 15. This means that phonetics and phonetic rules are not the same as English ones and the available English libraries can’t be used. ACKNOLEDGEMENTS ПРИНЦИПОВІ КОНЦЕПЦІЇ ТА СТРУКТУРУВАННЯ РІЗНИХ РІВНІВ ОСВІТИ З ОПТИКО-ЕЛЕКТРОННИХ ІНФОРМАЦІЙНО- ЕНЕРГЕТИЧНИХ ТЕХНОЛОГІЙ 10 This research was financed by the 56-CEEX (TERAPERS) and 131-CEEX (INTEROB) research grants. REFERENCES 4. “Speech Recognition”, http://en.wikipedia.org/wiki/Speech_ recognition. 5. Young Steve, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying (Andrew) Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, Phil Woodland, The HTK Book. COPYRIGHT 2001-2006 Cambridge University Engineering Department. 6. “Hidden Markov Model”, http://en.wikipedia.org/wiki/ Hidden_Markov_model. 7. Smith, Steven W., “The Scientist and Engineer's Guide to Digital Signal Processing”, Second Edition, California Technical Publishing, San Diego, California. 8. “Discrete Fourier Transform”, http://en.wikipedia.org/wiki/Discrete_Fourier_ transform. 9. Rabiner L. R., “A tutorial on hidden Markov models and selected applications in speech recognition”. 10. Viterbi Andrew J., “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm”, IEEE Transactions on Information Theory 13(2):260–269, April 1967. (Section IV.). 11. “Viterbi Algorithm”, http://en.wikipedia.org/wiki/Viterbi_ algorithm. 12. “Baum-Welch Algorithm”, http://en.wikipedia.org/ wiki/Baum_Welch_algorithm. 13. “Cepstrum”, http://en.wikipedia.org/wiki/Cepstrum. 14. Iwanashi, Naoto: Active and Unsupervised Learning for Spoken Word Acquisition Through a Multimodal Interface. 15. Chivu C. 2007. Applications of Speech Recognition for Romanian Language. Advances in Electrical and Computer Engineering, Suceava, Romania 1/2007, volume 7 (14), pp. 29-3. Надійшла до редакції: 29.11.2008р. ALEXANDRU GOLOCA - assistant in Electrical Engineering and Computer Science at the “Stefan cel Mare” University of Suceava, рhone: +40-727-713078, E-mail: alexg@eed.usv.ro STEFAN-GHEORGHE PENTIUC - professor on the Faculty of Electrical Engineering and Computer Science University “Stefan cel Mare” of Suceava, Romania, phone: +40.230.524.801, http://www.eed.usv.ro/~pentiuc , E-mail: pentiuc@usv.ro