Urban Sound Classification with Convolutional Neural Networks
Course project — 82% accuracy on UrbanSound8K using CNN + MFCC feature extraction
Authors
Rishabh Bhartiya
Date
May 2022
Institution
Univ. Milano
Supervisor
Prof. Nicolò Cesa-Bianchi
Course
Statistical Methods of Machine Learning
Abstract
This project investigates deep learning approaches for environmental audio classification, framing sound recognition as an image classification problem via Mel-Frequency Cepstral Coefficient (MFCC) feature extraction. A 7-layer Conv1D CNN was designed and trained on the UrbanSound8K benchmark dataset (8,732 samples, 10 classes) achieving 82% test accuracy — competitive with state-of-the-art results reported in literature at the time. The study also benchmarks classical ML baselines (KNN, Decision Tree, Random Forest) against the CNN, confirming that deep feature learning substantially outperforms handcrafted feature + shallow classifier pipelines on audio data. Completed under Prof. Nicolò Cesa-Bianchi for the Statistical Methods of Machine Learning course at Università degli Studi di Milano.
82% test accuracy on UrbanSound8K — competitive with state-of-the-art at submission
7-layer Conv1D CNN processing MFCC features, outperforming KNN, Decision Tree, Random Forest baselines
Supervised by Prof. Nicolò Cesa-Bianchi — one of Europe's most cited ML researchers
Foundation for current production TTS/STT pipelines at Edza.ai
Full Document
← → arrow keys to navigate · scroll within viewer
Loading
Overview
This project was completed as part of the Statistical Methods of Machine Learning course, supervised by Prof. Nicolò Cesa-Bianchi — a world-renowned researcher in online learning and statistical ML theory. The task: classify 10 categories of urban environmental sounds using deep learning.
Problem Framing
Raw audio waveforms are high-dimensional and temporally complex. The key insight is to transform audio into a 2D time-frequency representation (MFCC / Mel-spectrogram) and treat it as an image classification problem, enabling CNN architectures to leverage spatial correlation in the frequency domain.
Dataset: UrbanSound8K
- 8,732 labelled audio clips, each ≤4 seconds
- 10 classes: Air Conditioner, Car Horn, Children Playing, Dog Bark, Drilling, Engine Idling, Gunshot, Jackhammer, Siren, Street Music
- Sourced from Freesound.org — real-world recordings with background noise
- Pre-defined 10-fold cross-validation split to prevent data leakage
Feature Extraction
Audio was processed using Librosa. Multiple spectral representations were extracted and compared:
- MFCC (40 coefficients) — primary feature; captures perceptual frequency characteristics
- Mel-Spectrogram — non-linear frequency scale matching human auditory perception
- Chromagram — pitch class distribution, useful for tonal sounds
- Spectral Centroid — brightness measure, distinguishes impulsive vs sustained sounds
- Spectral Roll-off — frequency below which 85% of energy is concentrated
import librosa
import numpy as np
def extract_features(file_path: str, n_mfcc: int = 40) -> np.ndarray:
audio, sr = librosa.load(file_path, res_type='kaiser_fast', mono=True)
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
mfccs_scaled = np.mean(mfccs.T, axis=0) # time-averaged
return mfccs_scaled # shape: (40,)
CNN Architecture
A sequential 7-layer Conv1D model was designed to process 1D MFCC feature vectors:
- 3× Conv1D layers (32 → 64 → 128 filters, kernel size 2, ReLU activation)
- Dropout (50%) after each convolutional layer to prevent overfitting
- Flatten layer to convert feature maps to a 1D vector
- Dense (64, ReLU) → Dense (10, Softmax) output layer
- Optimizer: Adam | Loss: Categorical Cross-entropy | Epochs: 100 | Batch: 32
Baseline Comparison
Three classical ML classifiers were benchmarked against the CNN using the same MFCC features:
- K-Nearest Neighbors: ~80% accuracy
- Decision Tree: ~65% accuracy
- Random Forest: ~80% accuracy
- CNN: 82% accuracy — best performer, with smoother training curve
Results
- Test accuracy: 82% on UrbanSound8K
- Training accuracy at epoch 100: ~85% (minimal overfitting)
- Best performing classes: Gunshot, Siren (distinctive spectral signature)
- Most confused classes: Drilling vs Jackhammer (similar repetitive waveforms)
Code
Full implementation available on GitHub: github.com/monkrishabh/SOUND8K
Connection to Current Work
The MFCC-based audio classification pipeline built here directly informs my current production work on TTS/STT systems at Edza.ai. The same Librosa feature extraction approach — MFCC computation, mel-spectrogram generation, spectral feature analysis — is used daily in production audio pipeline evaluation and model quality assessment.