Table of Contents

Introduction

In this article, I will introduce you to the basic concepts and commonly used terms related to digital audio.

Are you curious about

How are sounds stored and transmitted over the Internet?
What is “pitch”? How do we perceive loudness?
What do 48 kHz and 16-bit depth mean in audio?
What are stereo, mono, and surround sound?

If yes, then, this article is a good place to start. I’ll keep it simple and provide an intuitive explanation of these concepts.

Let’s get started without any further ado, shall we?

Real-world Sound

In the real world, sounds are produced when objects vibrate. Human vocal chord, musical instruments, airplanes and cars, sirens, bells, whistles – all of these create sound when something in them vibrates, causing the air pressure around them to vary periodically in tune with the vibrating object. This difference in air pressure propagates through the medium (i.e., air) until it reaches the listener’s ears, which senses the variation in air pressure and sends signals to the brain to cause a sensation, which we term as “sound.”

Well, real-world sound mostly travels through the air. Still, it can also travel through any other medium with molecules that get compressed and released to propagate the sound pressure through it (but not in a vacuum empty of molecules).

There are two properties of vibration that are critical to understanding how sound works and they are amplitude and frequency.

Amplitude refers to the intensity of vibration or the extent (quantity) of change caused in the air pressure. The more the air is pressurized, the louder is the sound.
Frequency is also referred to as pitch with respect to sound. It simply refers to the number of times the sound source vibrates in one second. Pitch is associated with the shrillness of sound. Higher the pitch, the shriller the sound. Pitch is the characteristic that mostly helps the brain in identifying the voice or the source of the sound.

So far, so good.

Now let’s understand how real-world sound gets captured, processed, transmitted, and reproduced by machines.

Capturing Audio

As we all know, processors (in computers, mobiles, network routers, TVs, STBs, personal audio devices,…) work with numbers and not with real-world signals like sound and light.

So, as the first step, we need to convert sound to a form that processors can understand and process.

Microphones (mics in short) are the devices that convert sound to a form understood by processors.

In technical terms, a mic is a transducer that converts sound (i.e., variation in air pressure, as we learned earlier) to a continuous voltage proportional to the air pressure. By recording the voltage variation across time, we get an audio signal waveform that can be processed mathematically to achieve various objectives.

The following figure shows a continuous periodic sine tone audio signal (which is considered fundamental for signal analysis and processing). The x-axis denotes time in seconds, and the y-axis denotes the amplitude.

Mathematically, the signal can be described as

y = sin(x)

YSauzVaNxBzxCgIXkgU kHXin28Wv6xq76KIr88EFaGq glIaOG13j9r6yHsCIOJ2hyTUirnaqpKv4EKZYZSc1Isq SJw2k0hu8gkbxCGEgfgdgeRopUULxrPh2jHACg1H wOysr

Notice that the sine wave completes one cycle in 0.001 seconds (or 1 millisecond), and the same waveform keeps repeating again and again. This means that the waveform has 1000 cycles in one second. This corresponds to the frequency or pitch of the sound. The units in which Frequency is represented is called Hertz (number of periods in one second; written as Hz). So the frequency of this sound signal will be mentioned as 1000 Hz or 1 kHz.

Digitization of Audio by Sampling & Quantization

It is impossible to store a continuous signal or waveform (known as an analog signal) in a computer’s memory. It needs to be digitized, i.e., converted into a finite series of bytes that closely approximates the actual continuous voltage signal, to allow processors to process the signal.

The first step towards achieving this is called sampling. Sampling is a process to approximate the analog signal by recording the amplitude values at equally-spaced time intervals (known as samples) and discarding the rest of the amplitude values. The figure below shared shows the sampled version of the continuous signal shown in the previous figure.

The number of samples that are taken from each second of audio is known as the sampling rate. If we record 48000 samples per second of audio, the sampling rate is 48000 Hz or 48 kHz. So now you know! By the way, the sampling rate is 48 kHz in the figure above.

It turns out that all the sounds that humans can hear can be fully recovered even after sampling if the sampling rate is higher than a signal-dependent threshold. We need to delve into signal processing concepts and theory to understand this, which we’ll avoid in this introductory article.

44.1 kHz (common in consumer audio) and 48 kHz (common for audio in video) sampling rates are widely used as they work well for almost all audio signals.

After the signal is sampled, a fixed number of bits (called bit depth) are allocated for storing the amplitude of each sample. 16, 24, and 32 are the bit-depths used in digital audio processing. 16 bits per sample is widely used as it is sufficient for end-user applications. Higher bit depths are used in professional audio applications like content creation (capturing), mixing, mastering, and editing.

There is an important effect of bit depth that we need to note. Amplitudes of an analog signal are real numbers with infinite possible values. Consider a digital audio stream that uses 16 bits for representing amplitude. Then the maximum number of amplitude values that can be represented is 2^16 = 65,536.

So the amplitude range of the analog signal is divided into 65,536 intervals, and the amplitudes of all samples are mapped to a representative value in their interval. As a result, a small precision loss is introduced in the amplitude values during analog-to-digital conversion.

The good news is that this loss is only mathematical but not really perceivable by human listeners. This step of fitting analog real number amplitudes within a bit-depth number of bits is called quantization, and the precision loss described above is termed quantization error.

What is Mono, Stereo, Surround Sound in Audio?

When we hear a sound, what are the things that we sense from it? Let me list down some of them.

Who is the speaker or the source of sound (e.g., a vehicle or a musical instrument)?
What is he trying to convey?
Loudness (too high, too low, comfortable)
Feelings of pleasure (music), annoyance (noise)
The direction from which the sound is coming
The direction in which the sound source moves if it is a moving object (e.g., coming towards you, going away from you, ascending, descending, moving from right to left, etc.)
The distance of the source from you (near, far,…)

Channels in Digital Audio

The brain needs to process the sounds heard by both ears to judge direction and distance. How do we replicate our real-life experience with digital audio systems?

This is accomplished by using more than one speaker placed (ideally) at acoustically strategic positions and playing slightly different recordings of the same scene on each speaker. The different recordings of (usually) the same scene are termed as channels. All the channels are combined into a single digital audio file for storage or streaming.

In professional audio production (live telecast, music, movies, TV shows,…), several mics are placed strategically for recording. The feeds from the mics are mixed into different channels by experts to produce the program’s final digital audio.

Mono and Stereo Sound

Terms like mono, stereo, 5.1, 7.1,… refer to the number of channels in the audio. Mono has only one channel; stereo has two channels (left and right), 5.1 has 6 channels (5 main channels and 1 subwoofer channel), and so on.

A stereo speaker system has left and right speakers in front of the listener. A 2.1 system adds a subwoofer (low frequency) channel to stereo. A 3 channel system has left, right, and center speakers.

Surround Sound

“Surround Sound” audio (5.1 and 7.1 used in Dolby Digital and DTS are common examples) uses more channels located to the side and rear of the listener apart from the standard left, right, and center channels to create a sensation of sound coming from any direction surrounding the listener at ear height level. For example, a 5.1 system typically has left, right, and center speakers in front, with left and right surround speakers slightly behind the listener.

Immersive Sound

This concept has been taken to the next level through “immersive sound” or “object-based audio” that allows sound sources to be virtually placed anywhere in 3D space, including above the listener. Many of the “immersive sound” systems use height speakers above the listener’s level apart from ground speakers. For example, 7.1.4 recommended for Dolby Atmos has 7 ground speakers, 1 subwoofer, and 4 height speakers.

This brings us to the end of this introductory article on digital audio. Hope you found it useful and enjoyable.

Feel free to share your comments on this article below.

Mohammed Harris

Mohammed Harris is a freelance writer on technology, especially multimedia and its applications.

Previously, he worked in the embedded multimedia software industry for more than 15 years, with successful stints in multimedia codec and systems engineering, project management, and technical management. He delivered several successful projects and solutions for Tier-1 customers from semiconductor, consumer electronics OEMs, and multimedia algorithm / IP licensing majors.

He has a keen interest and strong expertise in multimedia signal processing.

He holds a Bachelor’s degree in Electrical and Electronics Engineering from Birla Institute of Technology and Science (BITS), Pilani (India), and a Master’s degree in Electrical Engineering with emphasis on Communications and Signal Processing from the University of California, San Diego (USA).

Fundamentals of Digital Audio – Simplified