Back to Research
Literature ReviewComputer Vision

Face Recognition in Neural Networks: A Literature Review

Synthesis of 15+ IEEE papers covering CNN architectures, GANs, and heterogeneous face recognition

Authors

Rishabh Bhartiya

Date

December 2021

Institution

Univ. Milano

Supervisor

Prof. Vincenzo Piuri

Abstract

This literature review synthesizes 15+ IEEE-published papers on neural network approaches to face recognition, covering the progression from classical feature-based methods to deep CNN architectures, generative adversarial approaches, and multi-task learning frameworks. The review examines four principal problem settings: standard face recognition (controlled environment), heterogeneous face recognition (cross-modality: visible vs NIR vs thermal), face super-resolution via GANs, and multi-task joint recognition. Key architectural innovations reviewed include FaceNet (triplet loss), DeepFace (3D alignment), ArcFace (additive angular margin loss), and conditional GANs for cross-modal synthesis. The review identifies model robustness under pose, illumination, and occlusion variation as the central open challenge.

Synthesis of 15+ IEEE papers spanning FaceNet, ArcFace, DeepFace, and GAN-based approaches

Covers 4 research threads: standard FR, heterogeneous (cross-modal) FR, super-resolution, multi-task learning

Identifies loss function design (ArcFace margin loss) as more impactful than architecture choice

Bridges to audio AI: connects face embedding techniques to speaker verification literature

Full Document

← → arrow keys to navigate · scroll within viewer

100%

Loading

Scope and Motivation

Face recognition sits at the intersection of computer vision, deep learning, and human-computer interaction — and is directly relevant to my broader interest in perceptual AI (audio, image, multimodal systems). This review was conducted as part of my MS coursework to develop systematic understanding of how the field evolved from handcrafted features to end-to-end deep learning.

Review Structure

The 15+ papers are organized across four research threads:

Thread 1: Standard Face Recognition (Controlled Environments)

Papers reviewed: FaceNet (Schroff et al., Google), DeepFace (Taigman et al., Facebook), ArcFace (Deng et al., Imperial College London).

  • FaceNet — introduces triplet loss for direct metric learning in embedding space. A key insight: face verification and recognition collapse into a single embedding model when trained with sufficient data (200M images in original paper)
  • ArcFace — additive angular margin loss improves class separability in hyperspherical embedding space. Became the dominant loss function for face recognition post-2019
  • DeepFace — 3D face alignment before CNN feature extraction; demonstrated importance of normalization pre-processing

Thread 2: Heterogeneous Face Recognition (Cross-Modality)

Papers reviewed: NIR-VIS synthesis via cGAN, thermal-to-visible translation, sketch-photo matching.

Heterogeneous FR addresses the domain gap between different imaging modalities (visible light vs near-infrared vs thermal vs sketch). The dominant approach uses conditional GANs to synthesize visible-domain images from other modalities before applying a standard FR model — bridging the gap without requiring paired training data.

Thread 3: Face Super-Resolution via GANs

Papers reviewed: SRGAN applications to low-resolution face recognition, identity-preserving super-resolution.

Surveillance and real-world images are often low-resolution (16×16 to 32×32 pixels). Standard SR methods optimize pixel-level PSNR but destroy identity-discriminative features. Identity-preserving SR adds a face recognition loss term to the GAN training objective, forcing the generator to preserve features that matter for recognition rather than perceptual sharpness.

Thread 4: Multi-Task Learning for Face Analysis

Papers reviewed: joint face detection + landmark localization + recognition, attribute prediction combined with identity verification.

Multi-task frameworks share early convolutional layers across related face analysis tasks, using task-specific heads for each objective. The shared representation learns more generalizable face features than single-task training, improving performance on all tasks simultaneously — particularly when individual tasks have limited data.

Key Observations Across Papers

  • Loss function design is more impactful than architecture choice in modern face recognition — ArcFace outperforms larger models trained with softmax loss
  • Data scale dominates for standard FR; for heterogeneous FR, synthetic data from GANs partially compensates for limited paired training data
  • Pose and occlusion remain the two primary failure modes across all reviewed systems, even at state-of-the-art accuracy on benchmark datasets
  • The benchmark gap is real: models achieving 99%+ on LFW (Labeled Faces in the Wild) perform significantly worse on surveillance-grade or cross-modality data

Connection to My Research Direction

Face recognition and audio recognition share more architecture than domain similarity suggests. Both transform raw perceptual signals (pixels, waveforms) into discriminative embeddings via convolutional feature extraction. The triplet loss approach from FaceNet has been directly applied to speaker verification (x-vectors, d-vectors). The cross-modal synthesis techniques in heterogeneous FR mirror the audio-to-spectrogram transformation I use in TTS quality evaluation.

This review established my systematic understanding of how deep learning solves perceptual recognition problems — a foundation that directly informs my current work on multimodal AI systems spanning audio and vision.

Face RecognitionCNNGANArcFaceFaceNetHeterogeneous FRMulti-task LearningLiterature Review

Univ. Milano

December 2021

Related Works

All research