Back to Research
Course ProjectAudio AI

Bird Voice Recognition Using CNN and MFCC

Academic paper — multi-class bird species identification from field audio recordings

Authors

Rishabh Bhartiya

Date

March 2022

Institution

Univ. Milano

Supervisor

Prof. Stavros Ntalampiras

Abstract

This paper applies convolutional neural network (CNN) architectures combined with Mel-Frequency Cepstral Coefficient (MFCC) feature extraction to the problem of automatic bird species identification from field recordings. Working with the British Birdsong Dataset (88 species, 264 recordings from the Xeno-Canto collection), the study extracts multiple spectral representations — MFCC, Mel-spectrogram, Chromagram, Spectral Centroid, Spectral Roll-off — and benchmarks classical ML classifiers (KNN, Decision Tree, Random Forest, Logistic Regression, GaussianNB) against a deep learning approach. The 88-class identification task on a severely data-limited corpus demonstrates the challenges of acoustic species classification in real-world bioacoustic settings and motivates future directions including data augmentation, transfer learning, and real-time inference.

88-class bird species identification from field recordings — CNN achieves 40% vs 1.1% random chance

Extracted 6 spectral features: MFCC, Mel-spectrogram, Chromagram, Spectral Centroid, Roll-off

Supervised by Prof. Stavros Ntalampiras — specialist in bioacoustics and audio ML

Identified data scarcity as the core bottleneck, motivating transfer learning future work

Full Document

← → arrow keys to navigate · scroll within viewer

100%

Loading

Domain Context

Automatic bird species identification has significant applications in ornithology, ecosystem monitoring, and biodiversity conservation. Birds are indicators of ecological health, and their vocalizations provide a non-invasive monitoring channel. This work investigates whether CNN-based audio classification — proven effective in urban sound and speech recognition — transfers to the bioacoustics domain.

The Core Challenge: Data Scarcity

The British Birdsong Dataset contains only 264 recordings across 88 species — fewer than 3 recordings per class on average. This is an extreme low-data regime for deep learning, which typically requires thousands of samples per class. The experimental results must be interpreted in this context.

Dataset

  • Source: British Birdsong Dataset (Xeno-Canto subset)
  • Size: 264 audio recordings in .flac format
  • Classes: 88 bird species commonly heard in the United Kingdom
  • Metadata: species, genus, country, latitude/longitude, recording type, license
  • Recording types: call, song, flight call, alarm call, juvenile, drumming, subsong

Feature Engineering

Six spectral representations were extracted using Librosa:

  • MFCC (40 coefficients) — compact spectral representation, primary classification feature
  • Mel-Spectrogram — non-linear frequency scale, captures harmonic structure
  • Spectrogram — raw time-frequency representation
  • Chromagram — 12-bin pitch class distribution
  • Spectral Centroid — perceptual brightness of the sound
  • Spectral Roll-off — frequency concentration metric

import librosa
import numpy as np

def extract_bird_features(file_path: str) -> np.ndarray:
    audio, sr = librosa.load(file_path, res_type='kaiser_fast')
    
    # Primary: MFCC
    mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=40)
    mfcc_scaled = np.mean(mfcc.T, axis=0)
    
    return mfcc_scaled

Model Architecture

A progressive Conv1D CNN with increasing filter density was implemented: 3 × Conv1D layers (32 → 64 → 128 filters), each followed by 50% dropout, then Flatten → Dense (softmax over 88 classes). Training: 100 epochs, batch size 32, Adam optimizer, categorical cross-entropy.

Baseline Classifiers

  • K-Nearest Neighbors: 16.98%
  • Decision Tree: 32.07%
  • Logistic Regression: 3.77%
  • Gaussian Naïve Bayes: 5.66%
  • Random Forest: 16.98%
  • CNN: 40% — best performer on this severely data-limited task

Honest Assessment of Results

40% accuracy on an 88-class problem with <3 samples per class on average is non-trivial — random chance would give ~1.1%. That said, the results fall significantly short of production-grade bioacoustic systems, which typically use much larger datasets (BirdCLEF competition uses hundreds of thousands of recordings) and transfer learning from pre-trained audio models (VGGish, BirdNET, Wav2Vec).

The primary contribution of this work is demonstrating CNN applicability to bioacoustic classification and identifying data scarcity as the critical bottleneck — motivating future work on few-shot learning and data augmentation for wildlife audio.

Future Directions Identified

  • Transfer learning from VGGish or BirdNET pre-trained models
  • Audio augmentation: time-shifting, pitch-shifting, background noise injection
  • Real-time inference for in-field deployment on embedded devices
  • Behaviour analysis beyond species identification (self-scratching, flight patterns)

Relevance to Current Work

This study established my foundational expertise in audio signal processing — MFCC extraction, spectrogram interpretation, Librosa workflows — which I apply directly in production at Edza.ai (TTS quality evaluation) and HacktivSpace (multi-speaker voice cloning and MFCC-based evaluation pipelines).

CNNMFCCBioacousticsBird Sound ClassificationLibrosaSpectrogram

Univ. Milano

March 2022

Related Works

All research