Back to Research
Course ProjectAudio AI

Urban Sound Classification with Convolutional Neural Networks

Course project — 82% accuracy on UrbanSound8K using CNN + MFCC feature extraction

Authors

Rishabh Bhartiya

Date

May 2022

Institution

Univ. Milano

Supervisor

Prof. Nicolò Cesa-Bianchi

Course

Statistical Methods of Machine Learning

Abstract

This project investigates deep learning approaches for environmental audio classification, framing sound recognition as an image classification problem via Mel-Frequency Cepstral Coefficient (MFCC) feature extraction. A 7-layer Conv1D CNN was designed and trained on the UrbanSound8K benchmark dataset (8,732 samples, 10 classes) achieving 82% test accuracy — competitive with state-of-the-art results reported in literature at the time. The study also benchmarks classical ML baselines (KNN, Decision Tree, Random Forest) against the CNN, confirming that deep feature learning substantially outperforms handcrafted feature + shallow classifier pipelines on audio data. Completed under Prof. Nicolò Cesa-Bianchi for the Statistical Methods of Machine Learning course at Università degli Studi di Milano.

82% test accuracy on UrbanSound8K — competitive with state-of-the-art at submission

7-layer Conv1D CNN processing MFCC features, outperforming KNN, Decision Tree, Random Forest baselines

Supervised by Prof. Nicolò Cesa-Bianchi — one of Europe's most cited ML researchers

Foundation for current production TTS/STT pipelines at Edza.ai

Full Document

← → arrow keys to navigate · scroll within viewer

100%

Loading

Overview

This project was completed as part of the Statistical Methods of Machine Learning course, supervised by Prof. Nicolò Cesa-Bianchi — a world-renowned researcher in online learning and statistical ML theory. The task: classify 10 categories of urban environmental sounds using deep learning.

Problem Framing

Raw audio waveforms are high-dimensional and temporally complex. The key insight is to transform audio into a 2D time-frequency representation (MFCC / Mel-spectrogram) and treat it as an image classification problem, enabling CNN architectures to leverage spatial correlation in the frequency domain.

Dataset: UrbanSound8K

  • 8,732 labelled audio clips, each ≤4 seconds
  • 10 classes: Air Conditioner, Car Horn, Children Playing, Dog Bark, Drilling, Engine Idling, Gunshot, Jackhammer, Siren, Street Music
  • Sourced from Freesound.org — real-world recordings with background noise
  • Pre-defined 10-fold cross-validation split to prevent data leakage

Feature Extraction

Audio was processed using Librosa. Multiple spectral representations were extracted and compared:

  • MFCC (40 coefficients) — primary feature; captures perceptual frequency characteristics
  • Mel-Spectrogram — non-linear frequency scale matching human auditory perception
  • Chromagram — pitch class distribution, useful for tonal sounds
  • Spectral Centroid — brightness measure, distinguishes impulsive vs sustained sounds
  • Spectral Roll-off — frequency below which 85% of energy is concentrated

import librosa
import numpy as np

def extract_features(file_path: str, n_mfcc: int = 40) -> np.ndarray:
    audio, sr = librosa.load(file_path, res_type='kaiser_fast', mono=True)
    
    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
    mfccs_scaled = np.mean(mfccs.T, axis=0)  # time-averaged
    
    return mfccs_scaled  # shape: (40,)

CNN Architecture

A sequential 7-layer Conv1D model was designed to process 1D MFCC feature vectors:

  • 3× Conv1D layers (32 → 64 → 128 filters, kernel size 2, ReLU activation)
  • Dropout (50%) after each convolutional layer to prevent overfitting
  • Flatten layer to convert feature maps to a 1D vector
  • Dense (64, ReLU) → Dense (10, Softmax) output layer
  • Optimizer: Adam | Loss: Categorical Cross-entropy | Epochs: 100 | Batch: 32

Baseline Comparison

Three classical ML classifiers were benchmarked against the CNN using the same MFCC features:

  • K-Nearest Neighbors: ~80% accuracy
  • Decision Tree: ~65% accuracy
  • Random Forest: ~80% accuracy
  • CNN: 82% accuracy — best performer, with smoother training curve

Results

  • Test accuracy: 82% on UrbanSound8K
  • Training accuracy at epoch 100: ~85% (minimal overfitting)
  • Best performing classes: Gunshot, Siren (distinctive spectral signature)
  • Most confused classes: Drilling vs Jackhammer (similar repetitive waveforms)

Code

Full implementation available on GitHub: github.com/monkrishabh/SOUND8K

Connection to Current Work

The MFCC-based audio classification pipeline built here directly informs my current production work on TTS/STT systems at Edza.ai. The same Librosa feature extraction approach — MFCC computation, mel-spectrogram generation, spectral feature analysis — is used daily in production audio pipeline evaluation and model quality assessment.

CNNMFCCAudio ClassificationLibrosaUrbanSound8KDeep Learning

Univ. Milano

May 2022

Related Works

All research