Speech-To-Text Model using Deep Learning with Spectrograms

4 min readApr 19, 2020

Objective : Extract text from given speech.

A Brief History of Speech Recognition through the Decades

You must be quite familiar with speech recognition systems. They are ubiquitous these days — from Apple’s Siri to Google Assistant. These are all new advents though brought about by rapid advancements in technology.

Did you know that the exploration of speech recognition goes way back to the 1950s? That’s right — these systems have been around for over 50 years! We have prepared a neat illustrated timeline for you to quickly understand how Speech Recognition systems have evolved over the decades:

Fig 2 : History of Speech-To-Text through the Decades

Introduction to Speech Processing

What is Speech processing?

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals.

What is an Audio Signal?

An audio signal is a representation of sound, typically using a level of electrical voltage for analog signals, and a series of binary numbers for digital signals.

Different Feature Extraction Techniques for an Audio Signal

The first step in speech recognition is to extract the features from an audio signal which we will input to our model later. So now, we will walk you through the different ways of extracting features from the audio signal.

Spectrogram

A common step in feature extraction of a speech is frequency (spectral) analysis. Human speech can be considered to be fairly stationary over the analysis interval of 20- 25 msec. Hence the signal is analyzed in successive narrow time frames of 20–25 ms width. The spectral analysis of the speech signal is carried out by finding Discrete Fourier Transform (DFT) of the samples in the frame.

Mel Frequency Cepstral Coefficients (MFCC)

Mel Frequency Cepstral Coefficients (MFCCs) are the best known and most popular features, which are based on the known variation of the human ear’s critical bandwidths with frequency. This is presented in the Mel-frequency scale, which is a linear frequency space below 1000 Hz and a logarithmic space above 1000 Hz.

Time-domain

The audio signal is represented by the amplitude as a function of time. In simple words, it is a plot between amplitude and time. The features are the amplitudes which are recorded at different time intervals.

Frequency domain

In the frequency domain, the audio signal is represented by amplitude as a function of frequency. Simply put — it is a plot between frequency and amplitude. The features are the amplitudes recorded at different frequencies.

Data Set

TensorFlow recently released the Speech Commands Datasets. It includes 65,000 one-second long utterances of 30 short words, by thousands of different people. We’ll build a speech recognition system that understands simple spoken commands.

You can download dataset from below link : https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data

Speech-To-Text Model

Architecture

Procedure:

This experiment has done with Spectrogram features of each audio file. And in this we used only 10 words with audio files.

words=[“yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”]

We have used 23455 audio files data for ‘Speech-To-Text’ model.

Fig 7 : Number of samples for each command

Data splitting — Train,Cross-Validation and Test

Data has divided into 3 parts as train,cross-validation and test with 60%,20% and 20% of data for each part.

Each spectrogram size : (128,48)

Train data size : (14073, 128, 48)

Cross-Validation data size : (4691, 128, 48)

Test data size : (4691, 128, 48)

Results

Fig 10 : Train and Cross-Validation Accuracy

Conclusion

Finally, we build our own speech-to-text model which can identify simple commands from voice.If we have huge data and good computational systems then we can build much more better model with better results.