Latest News

Acoustic characteristics of speec

 Acoustic characteristics of speech


    The speech signal has a dual nature - on the one hand, it is an ordinary acoustic signal, which is the process of propagation of the energy of acoustic vibrations in an elastic medium. Like any acoustic signal, it can be represented in the form of sound waves, which represent the propagation of the processes of compression and discharge of particles of the medium, the shapes of the fronts of which depend on the properties of the source and propagation conditions.

     Therefore, like other acoustic signals, speech is characterized by a certain set of objective characteristics: the dependence of sound pressure on time (temporal structure of the sound wave), duration of sound, spectral composition,

    On the other hand, speech as a physical phenomenon causes certain subjective auditory sensations (loudness, pitch, timbre, localization, masking, etc.), and it was the problems of their interaction that were the subject of previous articles on psychoacoustics.

    A speech signal undergoes the same processing procedure in the auditory system as any other acoustic signal, i.e., based on its analysis, the same auditory sensations are formed - for example, the perception of speech in a completely unfamiliar language is no different from the perception of surrounding acoustic information - noise , whistle, clicks, etc.

     However, if a person perceives speech in the language he was previously trained in, then along with the processing of purely acoustic information (loudness, pitch, timbre, etc.), phonetic, and then semantic decoding of information occurs, for which special parts of the brain are connected.

    For many decades, and especially intensively in recent years, in connection with the development of technology and systems for automatic speech recognition and synthesis, the acoustic characteristics of speech signals have been studied, and attempts have been made to establish a connection between acoustic parameters and phonetic features of speech signals, i.e.

     attempts to understand how the brain, having received information about the nature of the change in sound pressure over time, extracts information about the semantic content of speech. A lot of results have already been obtained in this direction: the number of books and articles on these problems is in the thousands, as an example I can cite one of the last books of the famous scientist M.However, the study of purely acoustic characteristics of speech signals is of significant independent value for sound recording systems, radio broadcasting, computer speech processing, etc.

     for all processes of recording, processing, transmission and reproduction of speech signals,
which are fundamentally important for the work of a sound engineer. Therefore, let's start with the analysis of acoustic

Acoustic characteristics of speec

    Rice. 1. The level of the     speech signal

    characteristics of speech signals, and then we will try to dwell on their connection with phonetic features, and on the currently existing theories of auditory perception and speech processing.

    Analysis of the acoustic characteristics of a speech signal begins with recording the change in sound pressure over time using a microphone - this dependence of the instantaneous value of sound pressure on time is presented in the form of an oscillogram. Usually, in technical applications, in particular in computer processing, the sound pressure level averaged over a certain period of time is recorded as a function of time, this dependence is called a level gram. An example of a level-gram for the word "welcome" is shown in Figure 1.

    The type of the level-gram depends significantly on the time and method of averaging - in all sound programs the user is asked about this (though, as practice shows, he does not always guess about it).

     The averaging method can be uniform or exponential (for example, uniform or exponent in Sound Forge). 

Usually the averaging time is chosen for the peak level of 1 ... 2 ms, for the objective 15 ... 20 ms, and for the subjective 150 ... 200 ms. In the first case an accurate record of the peak values ​​of the signal is obtained; in the second, there are no unnecessary small details (this time is usually used in computer speech processing); Finally, in the latter, the time is selected during which the auditory system recognizes the timbre.

        If the average values ​​of the signals remain equal for certain periods of time, then such signals are called stationary. Sound signals (speech and music) are quasi-random and non-stationary signals, although for speech it is possible to indicate approximately such time intervals (about 2 ... 3 minutes) at which speech signals can be considered quasi-stationary.

    The received levelgrams allow for statistical, correlation and spectral analysis of the speech signal, which can be done using conventional audio programs, as well as with the help of special programs designed specifically for speech signals, taking into account their specificity: Ultrasound (Australia), CSRE (England), Viper (Germany), Praat (Holland), Phonograph (Russia), etc.

    Since a speech signal, like a music signal, is a quasi-random signal, i.e. its future values ​​can be predicted only with a certain probability, then all known methods of statistical analysis can be used to analyze its characteristics. 

    At the same time, the distribution in time of the following quantities is investigated:

    instantaneous values ​​and levels of the speech signal;
    durations of continuous existence of different levels;
    duration of pauses;
    distribution of maximum levels by frequency;
    distribution of current and average power;
   spectral power density.

    In addition, such important parameters for the practice of sound recording can be determined as the dynamic range and crest factor, the distribution of the main phonation frequency, the spectral distribution of formants, etc.

    Knowledge of the statistical characteristics of speech signals is necessary for the optimal organization of sound broadcasting systems, sound recording systems, modern speech compression systems, etc. 

    The study of these characteristics for Russian speech was carried out in the works of Furduev, Rimsky-Korsakov, Sapozhkov, Belkin, Shitov, etc.

    Directly from the analysis of the level of the speech signal, first of all, information on the distribution of instantaneous values ​​and levels of the audio signal in time and the duration of their exceeding the set value can be obtained. This allows you to determine the dynamic range and crest factor of a speech signal, as well as to establish the distribution of the duration of pauses,
segments of continuous speech sounds, distribution of the current and average signal power in time, etc.

    Rice. 2. Distribution of the     probability density of instantaneous values ​​of the speech signal. 1 - voice-over text; 2, 3, 4 - artistic reading

    If we dwell very briefly on these data, then we can note that the probability density distribution of instantaneous values ​​of the speech signal shown in Figure 2 is exponential and differs significantly from the normal distribution, which, for example, obeys jazz or choral music. Statistical analysis of the duration of the continuous existence of different levels in the speech signal shows that the most probable are peaks (peaks) with a duration of 12 ... 17 ms, from which it follows that the maximum signal levels are reached in short periods of time.

    The distribution of the duration of pauses in speech signals is also random, their average duration for speech is 0.4 s, and the total duration of pauses reaches 5% of the transmission time.

     The most important information that can be obtained from the analysis of the level of gram is the definition of the dynamic range of the speech signal and its crest factor.

     The dynamic range of an audio signal is the difference between its quasi-maximum and quasi-minimum level D = Lmax - Lmin. Quasi-maximum Lmax is understood as a signal level, the duration of the peaks above which is 1% (for speech) and 2% (for music) of the total duration of the signal segment.

     The quasi-minimum level Lmin is determined similarly (the relative duration is 99% and 98%). The crest factor values ​​are defined as the difference between the quasi-maximum and average signal level D = Lmax - Lav.

    The values ​​of the dynamic ranges of speech signals are in the range of 35 ... 45 dB, the values ​​of the crest factor are 10 ... 12 dB.

ConditionsDistance (cm)Average sound pressure, Pa (dB)Peak power (mW)Crest factor (dB)Maximum level area (Hz)
Telephone speech2.5    

average level

2 (100)0.2412250-500
loud4 (106)418500-1000
quiet1 (94)0.025eight250-500
Talk1000.05 (68)0.510250-500
Speaker1000.1 (74)2.012250-500

    Some data for the speech signal at the developed sound pressure and power levels are given in the table.

    If we recalculate the sound pressure levels for telephone speech at a distance of 100 cm, we get the following values: 68, 74, 62 dB.

It should be noted that for vocal speech (singing), these levels are significantly higher, and can reach values ​​of 115 dB per 1 m. sing in La Scala, if it is below 100 dB, then in a chamber ensemble, if it is below 90 dB, then there is no need to sing at all ... I wonder how many people would be left to sing on the stage today with such a criterion?

   Correlation analysis of a speech signal allows calculating the current autocorrelation function and setting the homogeneity limit, which are determined by the time during which the autocorrelation function reaches a certain limit value, independent of the delay time. For speech, this limit is 3 ... 5 s.

    Spectral analysis of a speech signal, like any acoustic signal continuously varying in time, can be performed on the basis of the recorded level of the gram using the Fourier transform.

     In any music editor, the operation of fast Fourier transform (FFT, FFT) is provided, which allows calculating its spectrum from the selected segment of the level of the gram.

    Analysis of the spectra of speech signals allows you to establish the shape of the envelope and select the formant frequency regions. Since the place and width of the formant regions are fundamentally important for speech recognition, special programs based on the method of linear prediction or cepstral analysis have been created to accurately determine the formant bands in a speech signal, which allow their automatic recognition.

    In addition, since the intonation of a speech utterance is determined by a change in the phonation frequency, the separation of the main phonation frequency from the recorded level gram and the nature of its dependence on time are of fundamental importance.

    Rice. 3 . Spectral distribution of the average power of a speech signal

    For an integral assessment of the properties of a speech signal, the power spectrum can be calculated and the power spectral density distribution can be constructed, which for a speech signal is shown in Figure 3, which makes it possible to establish that the main energy of the speech signal (B) is concentrated in the band 250 ... 1000 Hz, the decline towards high frequencies occur at a rate of 7 dB / oct after 500 Hz.

    Spectrum analysis makes it possible to construct a distribution curve of the amplitude composition of speech, which is very important for the practice of sound recording. An example for the range 1000… 1400 Hz is shown in Figure 4, (for other ranges the distributions are similar). 

    The distribution curve shows that more than 80% of the speech stream are amplitudes with a level of 45 dB, and only less than 10% of the amplitude with levels of 70 dB and above.

     This means that when processing speech phonograms, the desire to "clean up noise" can lead to the loss of a significant part of information, since low amplitude levels are mainly associated with consonants, and they are the carriers of the main semantic load in speech.

   Rice. 4 . Amplitude composition of speech in the band 1000 - 1400 Hz

    In addition to one-dimensional spectra (amplitude-frequency), modern algorithms make it possible to construct for any speech signal its three-dimensional (cumulative) spectra (for example, 3D-Frequency Analysis in Wave-Lab editors, etc.), where time is plotted along one axis, and frequency on the other. , on the third - the amplitude.

     (Figure 5). Such spectra make it possible to obtain much more information not only on the spectral composition of the signal, but also on the nature of its change in time.

     Three-dimensional spectra are widely used in the practice of studying various acoustic signals, however, for the analysis of speech signals, the most widespread are three-dimensional spectra of a special form-spectrogram.

    In 1940, the Bell Lab (USA) built a device called the "visible speech spectrograph", which made it possible to represent the speech spectrum in a three-dimensional form, only constructed somewhat differently than the usual three-dimensional spectrum.

     This is a kind of "top view" of a three-dimensional spectrum: time is plotted on the abscissa, frequency is plotted on the ordinate, and the amplitude is shown by the color intensity (the more intense, the greater the amplitude).

     Figure 6 shows an example of a spectrogram of the same speech signal, the 3D spectrum of which is given in Figure 5.                                                                                                                                

   Rice. 5. Three-dimensional     (cumulative) spectrum of the speech signal

    Spectrograms can be narrowband, broadband and auditory. Selecting the number of samples, i.e. the choice of the duration of the analyzed signal segment determines the frequency sweep accuracy (i.e. the distance between frequencies).

     It is impossible to provide simultaneously a "good" sweep both in frequency and in time, since they are related by some relation Df • Dt = const,
(by analogy with quantum mechanics called the "uncertainty principle"). 

    The higher the frequency accuracy, the worse the time sweep, and vice versa. Therefore, the frequency sweep accuracy depends in inverse proportion on the duration of the Fourier transform time window (for example, with a sweep width of 100 Hz, the time sweep will be 1/100 = 10 ms).

    Rice. 6. Speech signal         spectrogram

    In the practice of analyzing speech signals, two types of spectrograms are used: broadband and narrowband (Figures 7a, 7b). Narrowband spectrograms use a 45 Hz sweep frequency, which is lower than the lowest phonation frequencies in the voice, which allows, with such an accurate sweep, to clearly see the harmonics of the voice source along the vertical axis.

    As mentioned in previous articles, the speech signal is the result of the "convolution" (multiplication) of the sound signal created by the vocal source, for example, due to the modulation of the air during vibrations of the vocal cords, and the envelope, due to the resonant properties of the vocal tract (this determines its formant structure.

    Rice. 7. a) Broadband     spectrogram; b) narrow-band spectrogram

    On broadband spectrograms, usually with a sweep frequency of 300 Hz, vertical stripes along the time axis are clearly visible, associated with the appearance of individual air pressure pulses during vibrations of the vocal cords, and dark horizontal stripes corresponding to the formants are strongly emphasized. 

    Therefore, depending on the goals that are set when analyzing the speech signal, either broadband spectrograms are used (separate air pulses are emphasized, formants are underlined), or narrowband ones, where the overtones of the voice source are highlighted. 

    At the same time, it is possible to trace the change in the basic phonation frequency over time, which is of great importance in assessing the melodic pattern of speech, as noted above.

     In addition, the obtained values ​​of the spectra make it possible to estimate the distribution of energy in time.

    However, neither broadband nor narrowband spectrograms take into account the specifics of spectral analysis of the signal, which is performed in the inner part of the peripheral auditory system on the basilar membrane.

     Therefore, in recent years, taking into account the latest results in psychoacoustics, a technique for constructing "auditory" spectrograms has been developed.

     When constructing these spectrograms, filters with different passbands are used, the width of which corresponds to the width of the "critical bands" of hearing (or the width of the auditory filters in the spectral analysis of sounds on the basilar membrane).

    The width of the critical bands depends on the frequency; this dependence roughly corresponds to the width of one-third octave bands.

     In such a spectrogram at low frequencies (the first 4 ... 5 critical bands), a narrow-band signal processing takes place in frequency. 

    At high frequencies, the critical bands become much wider; this corresponds to a broadband spectrogram, i.e. there is a very accurate deployment in time.

    Thus, the auditory spectrogram reflects the perception and processing of the
speech signal in the auditory system much more accurately : at low frequencies, the main attention is concentrated on individual harmonics, at high frequencies, an integral assessment of the harmonics is performed, but the
dynamics of their envelope changes over time is precisely tracked - in the same way as occurs when evaluating the pitch.

    As a result, in the low-frequency region, the ear estimates the value of the main phonation frequency and its first overtones, and determines the pitch of the voice from them; in the upper part, the ear accurately estimates the change in the envelope over time, which allows it to highlight the formant pattern, which serves as basic information for the upper parts of the brain when determining the phonetic meaning of individual phonemes, syllables, etc.

    Thus, when analyzing the acoustic parameters of a speech signal in modern specialized programs, the following characteristics are assessed:
 • the   level diagram and all related parameters
(dynamic range, distribution of instantaneous signal values, current power, etc.);
   one-dimensional spectrum (distribution

Rice. 8. An example of speech signal analysis

formant regions);

    three-dimensional spectrum (change in the shape of the envelope over time);
    spectrograms (broadband, narrowband, auditory), from which such characteristics as changes in the fundamental phonation frequency in time, changes in formant regions, distribution of harmonics of a voice source, temporal structure of sound pressure pulses, etc. can be obtained.

    In addition, a number of programs provide for the operation of calculating nonlinear masking of speech signal components, removing inaudible components, calculating the distribution of formant bands taking into account their width and quality factor. etc.

     The general picture of speech signal analysis, usually performed in modern computer programs, is shown in Figure 8.

    Article from here

Aucun commentaire
Enregistrer un commentaire

    Reading Mode :
    Font Size
    lines height