This paper presents a automatic continuous speech recognition system with statistical language model. 
To choose documents for language model it is proposed to download large amount of texts 
from Internet and to use the modified K-means cluster algorithm to select useful documents. 
First experts form small document corpus with known subject.  
Agglomeration algorithm allows to build begin clusters for K-means algorithm. 
The threshold to reject far documents is introduces.