Course ProjectAudio AI

Bird Voice Recognition Using CNN and MFCC

Academic paper — multi-class bird species identification from field audio recordings

Authors

Rishabh Bhartiya

Date

March 2022

Institution

Univ. Milano

Supervisor

Prof. Stavros Ntalampiras

Download PDF

Abstract

This paper applies convolutional neural network (CNN) architectures combined with Mel-Frequency Cepstral Coefficient (MFCC) feature extraction to the problem of automatic bird species identification from field recordings. Working with the British Birdsong Dataset (88 species, 264 recordings from the Xeno-Canto collection), the study extracts multiple spectral representations — MFCC, Mel-spectrogram, Chromagram, Spectral Centroid, Spectral Roll-off — and benchmarks classical ML classifiers (KNN, Decision Tree, Random Forest, Logistic Regression, GaussianNB) against a deep learning approach. The 88-class identification task on a severely data-limited corpus demonstrates the challenges of acoustic species classification in real-world bioacoustic settings and motivates future directions including data augmentation, transfer learning, and real-time inference.

—

88-class bird species identification from field recordings — CNN achieves 40% vs 1.1% random chance

—

Extracted 6 spectral features: MFCC, Mel-spectrogram, Chromagram, Spectral Centroid, Roll-off

—

Supervised by Prof. Stavros Ntalampiras — specialist in bioacoustics and audio ML

—

Identified data scarcity as the core bottleneck, motivating transfer learning future work

Full Document

← → arrow keys to navigate · scroll within viewer

Bird Voice Recognition Using CNN and MFCC

100%

Domain Context

Automatic bird species identification has significant applications in ornithology, ecosystem monitoring, and biodiversity conservation. Birds are indicators of ecological health, and their vocalizations provide a non-invasive monitoring channel. This work investigates whether CNN-based audio classification — proven effective in urban sound and speech recognition — transfers to the bioacoustics domain.

The Core Challenge: Data Scarcity

The British Birdsong Dataset contains only 264 recordings across 88 species — fewer than 3 recordings per class on average. This is an extreme low-data regime for deep learning, which typically requires thousands of samples per class. The experimental results must be interpreted in this context.

Dataset

Source: British Birdsong Dataset (Xeno-Canto subset)
Size: 264 audio recordings in .flac format
Classes: 88 bird species commonly heard in the United Kingdom
Metadata: species, genus, country, latitude/longitude, recording type, license
Recording types: call, song, flight call, alarm call, juvenile, drumming, subsong

Feature Engineering

Six spectral representations were extracted using Librosa:

MFCC (40 coefficients) — compact spectral representation, primary classification feature
Mel-Spectrogram — non-linear frequency scale, captures harmonic structure
Spectrogram — raw time-frequency representation
Chromagram — 12-bin pitch class distribution
Spectral Centroid — perceptual brightness of the sound
Spectral Roll-off — frequency concentration metric


import librosa
import numpy as np

def extract_bird_features(file_path: str) -> np.ndarray:
    audio, sr = librosa.load(file_path, res_type='kaiser_fast')
    
    # Primary: MFCC
    mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=40)
    mfcc_scaled = np.mean(mfcc.T, axis=0)
    
    return mfcc_scaled

Model Architecture

A progressive Conv1D CNN with increasing filter density was implemented: 3 × Conv1D layers (32 → 64 → 128 filters), each followed by 50% dropout, then Flatten → Dense (softmax over 88 classes). Training: 100 epochs, batch size 32, Adam optimizer, categorical cross-entropy.

Baseline Classifiers

K-Nearest Neighbors: 16.98%
Decision Tree: 32.07%
Logistic Regression: 3.77%
Gaussian Naïve Bayes: 5.66%
Random Forest: 16.98%
CNN: 40% — best performer on this severely data-limited task

Honest Assessment of Results

40% accuracy on an 88-class problem with <3 samples per class on average is non-trivial — random chance would give ~1.1%. That said, the results fall significantly short of production-grade bioacoustic systems, which typically use much larger datasets (BirdCLEF competition uses hundreds of thousands of recordings) and transfer learning from pre-trained audio models (VGGish, BirdNET, Wav2Vec).

The primary contribution of this work is demonstrating CNN applicability to bioacoustic classification and identifying data scarcity as the critical bottleneck — motivating future work on few-shot learning and data augmentation for wildlife audio.

Future Directions Identified

Transfer learning from VGGish or BirdNET pre-trained models
Audio augmentation: time-shifting, pitch-shifting, background noise injection
Real-time inference for in-field deployment on embedded devices
Behaviour analysis beyond species identification (self-scratching, flight patterns)

Relevance to Current Work

This study established my foundational expertise in audio signal processing — MFCC extraction, spectrogram interpretation, Librosa workflows — which I apply directly in production at Edza.ai (TTS quality evaluation) and HacktivSpace (multi-speaker voice cloning and MFCC-based evaluation pipelines).

Bird Voice Recognition Using CNN and MFCC

Domain Context

The Core Challenge: Data Scarcity

Dataset

Feature Engineering

Model Architecture

Baseline Classifiers

Honest Assessment of Results

Future Directions Identified

Relevance to Current Work

Related Works

Urban Sound Classification with Convolutional Neural Networks

Face Recognition in Neural Networks: A Literature Review

Study of Random Forest and Its Variants