
steadiness or dynamicity of the feature

Time extent of the description provided by the feature

Global

Instantaneous

Abstractness of the feature

i.e. what the feature represents

is the feature abstract or concrete?

Is it a mathematical process with little concrete meaning

or does it represent something we can understand from experience?

Extraction process of the feature

directly computed from waveform

i.e. in the time domain

after a transform

spectral characteristics

from a model (e.g. sinusoidal model)

e.g. for harmonicity/noisiness

that include psychoacoustic/physical models

Feature groups

Temporal shape

computed from the waveform or amplitude envelope

attack time

temporal increase/decrease

effective duration

Temporal feature

auto-correlation coefficients

zero-crossing rate

Energy features

global energy

harmonic energy

noise energy

Spectral shape (timbral texture) features

centroid

spread

skewness

kurtosis

spectral roll-off

the frequency Rn below which 85% of the energy distribution of the magnitude spectrum is concentrated

MFCC

Mel-Frequency Cepstral Coefficients

Three steps:

(1) Mel-scale filterbank

The most

common implementation of MFCC is calculated using 13 linearly spaced filters

separated by 133.33 Hz between their center frequencies, followed by 27 log-

spaced filters (separated by a factor of 1.0711703 in frequency) resulting in 40

filterbank values for each STFT frame.

(2) Log energy computation

The next step consists of computing the logarithm of the magnitude of

each of the filterbank outputs. This can be viewed as a simple step of dy-

namic compression, making feature extraction less sensitive to variations in

dynamics.

(3) discrete cosine transform

reducing the dimensionality of the 40 filterbank

outputs by performing a discrete cosine transform (DCT)

From: Music Data Mining EDITED BY Tao Li Mitsunori Ogihara George Tzanetakis. Chapter 2

Delta and DeltaDelta

MPEG-7 Low level audio descriptors

spectral flatness

crest factors

Harmonic features

Instantaneous features calculated from sinusoidal modeling

harmonic/noise ratio

odd to even and tristimulus harmonic energy ratio

harmonic deviation

Perceptual features

relative specific loudness

sharpness

spread

From: Peeters, G. (2004). A large set of audio features for sound description (similarity and classification) in the CUIDADO project (pp. 1–25).

Other perspective

Statistical properties

Spectral shape

Technical/Signal Properties

Intensity properties

Lerch, A. (n.d.). Chapter 3: Instantaneous features. In An Introduction to Audio Content Analysis (pp. 31–69).

Pope

Time-domain Low level

e.g. windowed RMS

Time-domain high level

e.g. Tempo, beat structure

Frequency domain low level

e.g. pitch tracking, spectral peaks

Frequency domain high level

e.g. instrument identification

From Pope's BigMATBook

Feature classification in the MPEG-7 standard

MPEG-7 is a multimedia content description standard

Part 4 of the standard deals with audio

basic

Instantaneous waveform and power values

basic spectral

Log-frequency power spectrum and spectral features (e.g. spectral centroid, spectral spread, spectral flatness)

signal parameters

fundamental frequency and harmonicity of signals

temporal timbral

Log attack time and temporal centroid

spectral timbral

specialized spectral features in a linear frequency space…

spectral basis representations

a number of features used in conjunction for sound recognition for projections into a low-dimensional space.