This paper describes the development of multilevel multidecision systems for speech signal to text conversion, based on phonemes and syllables. The entire phoneme variety is extracted to select the training set and control sets to estimate the parameters of acoustic recognition models. The estimation of acoustic model parameters is based on mono-speaker speech corpus. The factors compensating the inconsistency of acoustic and linguistic component model scales are analyzed and their values are explored. A way to convert phonetic decoder output to word sequences is described. The results of experimental research and future plans are discussed. Keywords: multilevel speech recognition, syllable, control sets, continuous speech.