In this paper we investigate an extraction of speech data from audio stream. Our method includes unsupervised optimal self-segmentation of the audio stream into small, homogeneous segments. The homogeneity is defined on a base of the average amplitude and a zero-crossing in a frame. A measure of the homogeneity is entropy. In our approach we calculate a relative ratio between the average amplitudes of the neighboring homogeneous segments. For a speech signal this ratio is less than a threshold defined on a short pure speech signal. As a discriminative feature we use a percent of the homogeneous segments within 1 sec interval having high relative amplitude ratio. In the process of the classification each 1 sec is labeled incrementally as a speech or a non-speech segment. The discrimination technique shows high performance for more than six-hour data that include different types of audio.