Signal Processing Meets Deep Learning

Spectrograms, CNNs, and when classical methods still win

The promise of deep learning is that you can throw raw data at a neural network and it will learn the features that matter. In computer vision, this works: feed RGB pixels into a ResNet, and it learns edges, textures, and object parts without manual feature engineering. But for signals—audio, vibration, ultrasound, radar—the story is different. Raw waveforms are high-dimensional, temporally sparse, and contain information at multiple scales. Most successful signal classification systems don't start with the raw signal. They start with spectrograms.

This is the intersection where classical signal processing and deep learning meet. You use Fourier transforms and filterbanks to create a time-frequency representation, then treat that representation as an image. The CNN learns to classify defects in ultrasonic data, detect bearings about to fail, or identify vocal pathologies—all from what is essentially a 2D image derived from 1D time-series data.

This approach works because it respects the structure of the problem. Signals live in frequency space as much as they do in time space, and the most discriminative features are often spectral: harmonics, formants, resonances, transient events. Deep learning doesn't replace signal processing. It builds on top of it.

Why Raw Signals Are Hard for Neural Networks

Consider an ultrasonic inspection signal sampled at 10 MHz. One millisecond of data is 10,000 samples. If you feed this directly into a 1D CNN, you're asking the model to learn convolution kernels that can detect patterns across this high-dimensional space. The network would need to learn frequency-selective filters from scratch—essentially reinventing the Fourier transform through backpropagation. Some work has shown that 1D CNNs can learn these filters, but it requires enormous amounts of data and careful architectural choices. In low-data regimes—common in industrial inspection—this rarely works.

The problem is sample efficiency. A defect in a weld might manifest as a 200 microsecond echo at a specific frequency band. In the time domain, this is a needle in a haystack. In the frequency domain, it's a bright spot at a known location. The STFT gives you this for free.

There's also the issue of invariance. Signals shift in time due to triggering jitter, sensor placement, and environmental factors. A 1D convolution has limited receptive field, so detecting a pattern that can occur anywhere in a 10,000-sample window requires deep stacks of layers. The spectrogram collapses time into fixed-width bins, making the problem more tractable.

This doesn't mean you should never use raw waveforms. For some tasks—like speech synthesis or audio generation—you want to preserve phase information and fine temporal structure. But for classification and detection, spectrograms are almost always better. They give the model a head start by encoding domain knowledge into the representation.

Spectrogram Generation: STFT and Mel-Spectrograms

The Short-Time Fourier Transform is the workhorse of signal-to-image conversion. You window the signal into overlapping chunks, apply an FFT to each chunk, and stack the results to form a 2D array: time on the horizontal axis, frequency on the vertical, and magnitude encoded as intensity or color.

The key parameters are window size and hop length. A longer window gives better frequency resolution but worse time resolution. A shorter window does the opposite. For speech, a 25 ms window with 10 ms hop is standard. For ultrasonic inspection, you might use a 100-sample window with 50% overlap, depending on your sampling rate and the bandwidth of interest.

The mel-spectrogram is a perceptually motivated variant used primarily in audio. Instead of a linear frequency scale, it applies a filterbank that mimics the human ear's logarithmic frequency response. Low frequencies get more bins, high frequencies get fewer. This is useful for speech and music, where most of the information is below 8 kHz. For industrial signals, you typically stick with linear frequency scaling unless you're working with acoustic emissions that span multiple octaves.

In Python, this is straightforward with librosa for audio or scipy.signal.spectrogram for general signals. The spectrogram is often log-scaled (converting magnitude to decibels) to compress the dynamic range and make weak features more visible. This is equivalent to the log-mel spectrogram used in audio ML.

import numpy as np
from scipy.signal import spectrogram

# Compute STFT spectrogram
f, t, Sxx = spectrogram(signal, fs=sample_rate,
                        nperseg=256, noverlap=128)

# Convert to dB scale
Sxx_dB = 10 * np.log10(Sxx + 1e-10)

Once you have the spectrogram, it's just a 2D array. You can normalize it, resize it, and feed it to any image classification model.

Feature Extraction: Classical and Learned

Before deep learning, signal classification meant hand-crafted features. You would compute statistical summaries—mean, variance, skewness, kurtosis—in both time and frequency domains. For audio, you might extract MFCCs (Mel-Frequency Cepstral Coefficients), zero-crossing rate, or spectral rolloff. For vibration analysis, you'd look at RMS, peak-to-peak amplitude, crest factor, and specific frequency bands corresponding to bearing faults or gear mesh frequencies.

These features were domain-specific and required expertise. You had to know that a bearing defect produces sidebands around the shaft rotation frequency, or that a crack in a turbine blade changes the resonance peak. You'd extract these features, feed them to an SVM or random forest, and tune the model until it worked.

Deep learning replaces this with learned features. The CNN's early layers learn edge detectors and texture filters—analogous to Gabor filters in classical signal processing. The middle layers learn combinations of these primitives: patterns of harmonics, modulation structures, transient events. The final layers learn task-specific combinations that map directly to class labels.

This is powerful because the features adapt to the data. You don't need to know in advance which frequency bands matter. The model figures it out. But this requires data, and lots of it. If you have 100 labeled examples, hand-crafted features often outperform end-to-end learning. This is where hybrid approaches shine: extract spectrograms and a few engineered features, concatenate them, and train a model that uses both. You get the domain knowledge encoded in the features and the flexibility of learned representations.

CNN Architectures for Spectrogram Classification

Once you have spectrograms, the problem becomes image classification. Standard CNN architectures work out of the box. For small datasets, a simple custom architecture is often sufficient: a few convolutional layers with batch normalization and max pooling, followed by dense layers and a softmax output.

For larger datasets or when you need better performance, you can use established architectures. ResNet, EfficientNet, and MobileNet all work well. The choice depends on your constraints: latency, model size, accuracy requirements. ResNet-18 is a good starting point—deep enough to learn complex features, but not so deep that it overfits on small datasets.

One consideration is input size. Most ImageNet-pretrained models expect 224x224 RGB images. Your spectrogram might be 128x256 (frequency bins x time steps) and single-channel. You can resize it, pad it, or replicate the channel to create a fake RGB image. This is not elegant, but it works. Alternatively, you can modify the first convolutional layer to accept single-channel input.

Attention mechanisms—originally from NLP—have also found their way into signal processing. A simple attention layer can learn to focus on specific frequency bands or time intervals, effectively learning which parts of the spectrogram are most discriminative. This is useful when defects are localized in time or frequency, and you want the model to ignore irrelevant background noise.

Transfer Learning from Image Models

Transfer learning is the most powerful tool in the low-data regime. Instead of training a CNN from scratch on your 500 labeled spectrograms, you start with a model pretrained on ImageNet—a dataset of 1.2 million natural images—and fine-tune it on your data.

This works because the early layers of a CNN learn generic features: edges, corners, textures. These features transfer across domains. The later layers learn task-specific features, which you replace or fine-tune. In practice, you freeze the early layers, replace the final classification head, and train only the last few layers on your data. This reduces overfitting and training time.

The surprising result is that transfer learning works even when the source domain (natural images) and target domain (spectrograms) are very different. A model trained to recognize cats and dogs can be fine-tuned to detect cracks in ultrasonic data. The low-level features—edges, gradients, textures—are universal enough to generalize.

There are limits. If your spectrogram is highly structured—say, a narrow-band signal with a few harmonics—the ImageNet features might not help much. In these cases, training from scratch or using a shallower architecture can work better. But for most industrial signals, which are noisy and complex, transfer learning is the default strategy. Some recent work has explored pretraining on large-scale audio datasets like AudioSet or speech corpora. These domain-specific pretrained models can outperform ImageNet models for audio tasks, but the gains are often marginal unless you have a very large target dataset.

Industrial Inspection: NDT and Predictive Maintenance

The real-world applications are in industrial inspection. Nondestructive testing (NDT)—ultrasonic, eddy current, acoustic emission—generates signals that need to be classified as defect or no-defect. Traditional methods rely on manual inspection by trained operators, which is slow, subjective, and expensive.

Deep learning automates this. You collect ultrasonic A-scans from known-good and known-bad welds, convert them to spectrograms, and train a classifier. Deploy it on the production line, and it flags defects in real time. The system isn't perfect—it makes mistakes—but it's consistent, scalable, and can be updated as new defect modes are discovered.

Predictive maintenance is similar. You monitor vibration signals from rotating machinery—motors, pumps, turbines—and train a model to predict bearing failure, imbalance, or misalignment. The key challenge is that failures are rare, so you have extreme class imbalance. Techniques like oversampling, synthetic data generation, and anomaly detection help, but the fundamental problem is that you're training a model to recognize something it has barely seen.

This is where domain knowledge reenters the picture. You don't train a blind classifier. You use physics-based features—bearing fault frequencies, gear mesh harmonics—alongside learned features. You combine the CNN's output with rule-based heuristics: if the model says "bearing fault" but the frequency content doesn't match known fault patterns, you flag it for manual review.

In practice, these systems are deployed as hybrid pipelines: signal preprocessing, spectrogram generation, CNN-based classification, post-processing, and integration with existing SCADA or MES systems. The model is one component in a larger system, not a drop-in replacement for human expertise.

When Classical Signal Processing Beats Deep Learning

Deep learning is not a universal solution. There are cases where classical methods—matched filters, wavelet transforms, time-frequency analysis, envelope detection—outperform CNNs.

The first case is interpretability. A CNN gives you a probability distribution over classes. A classical method gives you a measurement: "the resonance peak shifted by 200 Hz" or "the envelope kurtosis increased by 3 sigma." For safety-critical applications—aerospace, nuclear, medical devices—you need to explain why the system flagged a defect. A deep learning model is a black box.

The second case is low-data regimes. If you have 20 labeled examples, a handcrafted feature extractor tuned by an expert will outperform a CNN. The expert encodes years of domain knowledge into those features. The CNN has to learn from scratch. In these cases, you don't need deep learning. You need a good feature set and a simple classifier.

The third case is when the signal model is known and parametric. If you're detecting a sinusoid in noise, a matched filter is optimal. If you're tracking a frequency-modulated signal, a phase-locked loop works better than any neural network. Deep learning shines when the problem is complex, high-dimensional, and hard to model analytically. But for well-understood problems with closed-form solutions, classical methods are faster, more reliable, and easier to deploy.

Finally, there's the issue of computational cost. A CNN requires a GPU for inference if you want real-time performance. A classical algorithm runs on a microcontroller. For edge devices—wireless sensors, embedded systems, IoT nodes—power and latency constraints often rule out deep learning.

The best approach is usually hybrid. Use signal processing to extract the representation—spectrograms, wavelets, envelope spectra—and use deep learning to classify it. Use domain knowledge to constrain the model: freeze certain layers, add physics-based loss functions, or augment the training data with simulated signals. The goal is not to replace one paradigm with the other, but to combine them in a way that leverages the strengths of both.

Conclusion

Signal processing and deep learning are complementary. Spectrograms bridge the gap, turning raw waveforms into images that CNNs can consume. Transfer learning makes this practical even with small datasets. Industrial applications in NDT and predictive maintenance show that these methods work in production, but they require careful integration with domain knowledge and classical techniques.

The key insight is that deep learning doesn't eliminate the need for signal processing. It shifts the question from "what features should I extract?" to "what representation should I use?" The answer is usually a time-frequency representation—a spectrogram, a wavelet transform, or something domain-specific—that preserves the structure of the signal while making it tractable for a neural network.

This is the state of the field as of 2024. As models get better and datasets get larger, we might see end-to-end learning directly from raw waveforms become practical. But for now, the sweet spot is the intersection: classical signal processing for representation, deep learning for classification, and domain expertise to tie it all together.