How to build an audio detector

We humans solve a challenge or find a solution to a task by applying the knowledge we have acquired by solving a similar task in the past. The more similar a new task to the previous one, the more easily we can solve it.

Transfer learning is a machine learning technique that is the same as the concept explained above. It uses the machine learning model trained in a task as the starting model for another similar task.

Transfer learning is often used in computer vision. It uses a pre-trained model. The pre-trained model is a model trained using a larger dataset similar to the problem we want to solve. Since the computational cost for training a model on the larger dataset is high, by using transfer learning we can overcome this.

What is transfer learning?

Before moving further we will understand what is neural and deep neural networks in simple terms.

What is a deep neural network?

Neural networks are the algorithm that mimics the human brain. A neural network consists of layers and each layer has a certain number of nodes. These nodes between the layers are connected by a certain weight. Each node applies some mathematical function(activation function) to the data that passes through it. First and last layers are known as input and output layers respectively and all the other layers in between them are known as hidden layers. When the depth of the neural network is high i.e, when we have more number of hidden layers between the input and output layers then we call it a deep neural network.

Photo by Luca Bravo / Unsplash

Most of the deep neural networks regardless of the dataset learn similar features in the first few layers. These learned features appear to be applicable to many similar tasks.

For example in image classification, irrespective of the input images the first few neural network layers always learn features to identify curves, dots, and other minute details. Such first-layer features are generic to all similar tasks. These features are known as generic features.

In transfer learning, we transfer the knowledge of these generic features to the second target network that has to solve the target task. This technique works well only if the features are similar to both base and target data.

In practice only a few trains the model from scratch as it requires a larger dataset to train and also the computational cost is high.

When to use transfer learning?

Depending on the size of the target dataset and similarity of the target data with base data the methods of using transfer learning differ.

Case 1: Large dataset and different from the base model

Since the dataset is large we can train the model from scratch. But if the computational cost is a concern, despite data dissimilarity we can use a pre-trained model by freezing base layers.

Case 2: Small dataset and different from the base model

If this is the case then we are in a difficult situation. It will be difficult to decide the number of layers to freeze and train. If we train deep layers then the model may overfit, if we train only the layers at the end then the model may not learn all the features. We can try using other techniques like data augmentation, ensemble modeling.

Case 3: Large dataset and similar to the base model

Here we can use any strategy we want. Since the dataset is large we can train the model from scratch or we can use the pre-trained model and froze base layers.

Case 4: Small dataset and similar to the base model

Here we can freeze the base layers of the pre-trained model and only train the last few layers.

Why transfer learning?

Don't need to train a model from scratch so less computational cost.
Improves the accuracy of the target model.
Fewer parameters to be trained.

Transfer learning in a real-world application:

Let us look into a real-world application of transfer learning. Consider we have recordings of random sounds and snore sounds. Our task is to detect snore or not-snore.

Photo by Markus Spiske / Unsplash

Most of the audio data are converted into spectrograms before they are fed into a model for training i.e, we convert sound classification problems to image classification for convenience.

As we have seen before that most of the features learned at the beginning of the image classification model are generic we can use a pre-trained image classification model like Mobilenet/Inception and apply transfer learning. Depending on the dataset size we can further decide the number of layers to be trained and frozen.

References

Audio Set: An ontology and human-labeled dataset for audio events(PDF)
CNN Architectures for Large-Scale Audio Classification(PDF)
https://www.youtube.com/watch?v=whNpO_Yn0Pk
https://medium.com/kansas-city-machine-learning-artificial-intelligen/an-introduction-to-transfer-learning-in-machine-learning-7efd104b6026