Course ProjectAudio AI

Urban Sound Classification with Convolutional Neural Networks

Course project — 82% accuracy on UrbanSound8K using CNN + MFCC feature extraction

Authors

Rishabh Bhartiya

Date

May 2022

Institution

Univ. Milano

Supervisor

Prof. Nicolò Cesa-Bianchi

Course

Statistical Methods of Machine Learning

Download PDF View Code

Abstract

This project investigates deep learning approaches for environmental audio classification, framing sound recognition as an image classification problem via Mel-Frequency Cepstral Coefficient (MFCC) feature extraction. A 7-layer Conv1D CNN was designed and trained on the UrbanSound8K benchmark dataset (8,732 samples, 10 classes) achieving 82% test accuracy — competitive with state-of-the-art results reported in literature at the time. The study also benchmarks classical ML baselines (KNN, Decision Tree, Random Forest) against the CNN, confirming that deep feature learning substantially outperforms handcrafted feature + shallow classifier pipelines on audio data. Completed under Prof. Nicolò Cesa-Bianchi for the Statistical Methods of Machine Learning course at Università degli Studi di Milano.

—

82% test accuracy on UrbanSound8K — competitive with state-of-the-art at submission

—

7-layer Conv1D CNN processing MFCC features, outperforming KNN, Decision Tree, Random Forest baselines

—

Supervised by Prof. Nicolò Cesa-Bianchi — one of Europe's most cited ML researchers

—

Foundation for current production TTS/STT pipelines at Edza.ai

Full Document

← → arrow keys to navigate · scroll within viewer

Urban Sound Classification with Convolutional Neural Networks

100%

Overview

This project was completed as part of the Statistical Methods of Machine Learning course, supervised by Prof. Nicolò Cesa-Bianchi — a world-renowned researcher in online learning and statistical ML theory. The task: classify 10 categories of urban environmental sounds using deep learning.

Problem Framing

Raw audio waveforms are high-dimensional and temporally complex. The key insight is to transform audio into a 2D time-frequency representation (MFCC / Mel-spectrogram) and treat it as an image classification problem, enabling CNN architectures to leverage spatial correlation in the frequency domain.

Dataset: UrbanSound8K

8,732 labelled audio clips, each ≤4 seconds
10 classes: Air Conditioner, Car Horn, Children Playing, Dog Bark, Drilling, Engine Idling, Gunshot, Jackhammer, Siren, Street Music
Sourced from Freesound.org — real-world recordings with background noise
Pre-defined 10-fold cross-validation split to prevent data leakage

Feature Extraction

Audio was processed using Librosa. Multiple spectral representations were extracted and compared:

MFCC (40 coefficients) — primary feature; captures perceptual frequency characteristics
Mel-Spectrogram — non-linear frequency scale matching human auditory perception
Chromagram — pitch class distribution, useful for tonal sounds
Spectral Centroid — brightness measure, distinguishes impulsive vs sustained sounds
Spectral Roll-off — frequency below which 85% of energy is concentrated


import librosa
import numpy as np

def extract_features(file_path: str, n_mfcc: int = 40) -> np.ndarray:
    audio, sr = librosa.load(file_path, res_type='kaiser_fast', mono=True)
    
    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
    mfccs_scaled = np.mean(mfccs.T, axis=0)  # time-averaged
    
    return mfccs_scaled  # shape: (40,)

CNN Architecture

A sequential 7-layer Conv1D model was designed to process 1D MFCC feature vectors:

3× Conv1D layers (32 → 64 → 128 filters, kernel size 2, ReLU activation)
Dropout (50%) after each convolutional layer to prevent overfitting
Flatten layer to convert feature maps to a 1D vector
Dense (64, ReLU) → Dense (10, Softmax) output layer
Optimizer: Adam | Loss: Categorical Cross-entropy | Epochs: 100 | Batch: 32

Baseline Comparison

Three classical ML classifiers were benchmarked against the CNN using the same MFCC features:

K-Nearest Neighbors: ~80% accuracy
Decision Tree: ~65% accuracy
Random Forest: ~80% accuracy
CNN: 82% accuracy — best performer, with smoother training curve

Results

Test accuracy: 82% on UrbanSound8K
Training accuracy at epoch 100: ~85% (minimal overfitting)
Best performing classes: Gunshot, Siren (distinctive spectral signature)
Most confused classes: Drilling vs Jackhammer (similar repetitive waveforms)

Code

Full implementation available on GitHub: github.com/monkrishabh/SOUND8K

Connection to Current Work

The MFCC-based audio classification pipeline built here directly informs my current production work on TTS/STT systems at Edza.ai. The same Librosa feature extraction approach — MFCC computation, mel-spectrogram generation, spectral feature analysis — is used daily in production audio pipeline evaluation and model quality assessment.

Urban Sound Classification with Convolutional Neural Networks

Overview

Problem Framing

Dataset: UrbanSound8K

Feature Extraction

CNN Architecture

Baseline Comparison

Results

Code

Connection to Current Work

Related Works

Bird Voice Recognition Using CNN and MFCC

Face Recognition in Neural Networks: A Literature Review

Study of Random Forest and Its Variants