Bird Voice Recognition Using CNN and MFCC
Academic paper — multi-class bird species identification from field audio recordings
Authors
Rishabh Bhartiya
Date
March 2022
Institution
Univ. Milano
Supervisor
Prof. Stavros Ntalampiras
Abstract
This paper applies convolutional neural network (CNN) architectures combined with Mel-Frequency Cepstral Coefficient (MFCC) feature extraction to the problem of automatic bird species identification from field recordings. Working with the British Birdsong Dataset (88 species, 264 recordings from the Xeno-Canto collection), the study extracts multiple spectral representations — MFCC, Mel-spectrogram, Chromagram, Spectral Centroid, Spectral Roll-off — and benchmarks classical ML classifiers (KNN, Decision Tree, Random Forest, Logistic Regression, GaussianNB) against a deep learning approach. The 88-class identification task on a severely data-limited corpus demonstrates the challenges of acoustic species classification in real-world bioacoustic settings and motivates future directions including data augmentation, transfer learning, and real-time inference.
88-class bird species identification from field recordings — CNN achieves 40% vs 1.1% random chance
Extracted 6 spectral features: MFCC, Mel-spectrogram, Chromagram, Spectral Centroid, Roll-off
Supervised by Prof. Stavros Ntalampiras — specialist in bioacoustics and audio ML
Identified data scarcity as the core bottleneck, motivating transfer learning future work
Full Document
← → arrow keys to navigate · scroll within viewer
Loading
Domain Context
Automatic bird species identification has significant applications in ornithology, ecosystem monitoring, and biodiversity conservation. Birds are indicators of ecological health, and their vocalizations provide a non-invasive monitoring channel. This work investigates whether CNN-based audio classification — proven effective in urban sound and speech recognition — transfers to the bioacoustics domain.
The Core Challenge: Data Scarcity
The British Birdsong Dataset contains only 264 recordings across 88 species — fewer than 3 recordings per class on average. This is an extreme low-data regime for deep learning, which typically requires thousands of samples per class. The experimental results must be interpreted in this context.
Dataset
- Source: British Birdsong Dataset (Xeno-Canto subset)
- Size: 264 audio recordings in .flac format
- Classes: 88 bird species commonly heard in the United Kingdom
- Metadata: species, genus, country, latitude/longitude, recording type, license
- Recording types: call, song, flight call, alarm call, juvenile, drumming, subsong
Feature Engineering
Six spectral representations were extracted using Librosa:
- MFCC (40 coefficients) — compact spectral representation, primary classification feature
- Mel-Spectrogram — non-linear frequency scale, captures harmonic structure
- Spectrogram — raw time-frequency representation
- Chromagram — 12-bin pitch class distribution
- Spectral Centroid — perceptual brightness of the sound
- Spectral Roll-off — frequency concentration metric
import librosa
import numpy as np
def extract_bird_features(file_path: str) -> np.ndarray:
audio, sr = librosa.load(file_path, res_type='kaiser_fast')
# Primary: MFCC
mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=40)
mfcc_scaled = np.mean(mfcc.T, axis=0)
return mfcc_scaled
Model Architecture
A progressive Conv1D CNN with increasing filter density was implemented: 3 × Conv1D layers (32 → 64 → 128 filters), each followed by 50% dropout, then Flatten → Dense (softmax over 88 classes). Training: 100 epochs, batch size 32, Adam optimizer, categorical cross-entropy.
Baseline Classifiers
- K-Nearest Neighbors: 16.98%
- Decision Tree: 32.07%
- Logistic Regression: 3.77%
- Gaussian Naïve Bayes: 5.66%
- Random Forest: 16.98%
- CNN: 40% — best performer on this severely data-limited task
Honest Assessment of Results
40% accuracy on an 88-class problem with <3 samples per class on average is non-trivial — random chance would give ~1.1%. That said, the results fall significantly short of production-grade bioacoustic systems, which typically use much larger datasets (BirdCLEF competition uses hundreds of thousands of recordings) and transfer learning from pre-trained audio models (VGGish, BirdNET, Wav2Vec).
The primary contribution of this work is demonstrating CNN applicability to bioacoustic classification and identifying data scarcity as the critical bottleneck — motivating future work on few-shot learning and data augmentation for wildlife audio.
Future Directions Identified
- Transfer learning from VGGish or BirdNET pre-trained models
- Audio augmentation: time-shifting, pitch-shifting, background noise injection
- Real-time inference for in-field deployment on embedded devices
- Behaviour analysis beyond species identification (self-scratching, flight patterns)
Relevance to Current Work
This study established my foundational expertise in audio signal processing — MFCC extraction, spectrogram interpretation, Librosa workflows — which I apply directly in production at Edza.ai (TTS quality evaluation) and HacktivSpace (multi-speaker voice cloning and MFCC-based evaluation pipelines).