This paper describes the development of multilevel multidecision systems 
for speech signal to text conversion, based on phonemes and syllables. 
The entire phoneme variety is extracted to select the training set 
and control sets to estimate the parameters of acoustic recognition models. 
The estimation of acoustic model parameters is based 
on mono-speaker speech corpus. The factors compensating the inconsistency 
of acoustic and linguistic component model scales are analyzed and their 
values are explored. A way to convert phonetic decoder output 
to word sequences is described. The results of experimental research 
and future plans are discussed.
Keywords: multilevel speech recognition, syllable, control sets, 
continuous speech.