We’ve all heard it before: “this is just an mp3, it doesn’t sound good”. We know there’s probably some truth to that, but to what extent? In this article we will talk about encoding formats, and specifically their unique encoding artifacts and why do they exist.
Let’s imagine you’re in a grocery store. They have shopping baskets that can fit 10 items each. You know that to make a great dinner, you have to buy twelve items. All of these items can't be fit to the basket (and because this is an example, you aren’t allowed to stuff them into your pants either). You have to prioritize the items you buy from the store, and need to leave two items off your shopping list.
You decide to leave out saffron and tomato purée, since you’re confident they would likely be left unnoticed even if they were in the dinner. You’ve successfully encoded your dinner in a lossy format, meaning some stuff, decided by a psychoacoustic model, was intentionally left out to fit the constraints.
That’s what, simply put, encoding is. Leave out pieces of data in order to get a smaller file size. Each encoder also “cyphers” the audio data so that only the right decoder can “decipher” the audio data. The codecs share understanding on how audio can first be compacted into e.g. two data points, which the decoder then knows how to expand back to audio.
But that alone is not enough. By interpolation we can only make the file smaller by a small factor. Something else needs to be get rid of. This is where psychoacoustics come into play. Masking is a phenomenon, where human ear will not register sounds that are masked by a sound more prevalent in either time or frequency domains, or both.
Let’s imagine ourselves into a court lobby. For some reason they’ve not invested in sound-proof doors, so the conversation can be heard from outside the room. To make the conversation unintelligible, one has to mask it with a broadband noise whose frequency spectrum is at least as wide as the voice masked and that plays 5 dB louder than the voice being masked. To make the speech completely inaudible the noise has to be 10 dB louder. In the example below, you will hear a track where the door lets the voice through almost unaltered. First, the noise will rise to 5 dB above the speech. Then, 10 dB, and then back to 5 dB. Compare these results.
However, usually when masking is used, there’s already a noise floor, or room tone. This includes the AC, water pipes and all constant noise sources in the room. On top of that there’s sometimes human-originated noise, which by itself can be quite masking. Let’s move ourselves from the courtroom lobby to a workplace cafeteria. Right next door is a meeting room, which, again, has very poor sound insulation. In this example, it’s assumed the door has a 10 dB sound reduction index (which is still very poor). In the audio below, you shall note that the conversation alone could be enough to mask the conversation in the meeting room, but it also masks the masking noise itself. Again, the noise rises from 0 dB to 5, then 10, back to 5 and 0.
You can compare auditory masking to real world by painting a picture. First, draw a cat (the animal probably doesn’t matter) in the middle. Then, using the same color (frequency spectrum in the audio example), draw a house in front of the cat (since louder = closer). You have now successfully masked a cat in the picture.
This is an example of how masking in frequency domain works. This example used laboratory-grade examples, that don’t directly translate to real world, mainly because of their clinical nature. It’s a widely used technique in hiding noises and could sometimes be very cheap when compared to structural improvements.
In audio encoding and specifically compression methods, it’s essential to ditch away the parts we don’t need. As we learnt from the example, we cannot hear information that’s behind something that’s a little bit quieter than the object masking it. Using bits to encode something that we can’t tell is missing is wasting bits, so we can safely throw away those bits that represented these unheard parts in the audio.
Another masking phenomenon is forward and backward masking, where a quieter sound, that’s either preceded or succeeded by a louder sound, gets unregistered. This however requires a dynamic bandwidth of nearly 80 dB to be demonstrated, so it’s not going to be included in this article. Stanford University has published a demonstration on this, see more:
Backward and forward masking
Now that we have two separate masking phenomenons, we have lots of information that we don’t need to encode in the material, which means we’re saving bits.
You may now wonder why is this necessary, and you’re right to ask that. It isn’t anymore. But when we started delivering and storing audio in a digital format, storage space and internet bandwidth weren’t as massive as they’re today. For example, a dial-up modem’s bandwidth in the 90’s was only 56 kbps. ISDN brought that up to 128 kbps and in the late 90’s cable internet started offering speeds of a whopping 1.5 Mbps. For example, with a non-cable internet, downloading a song could take tens of minutes. MP3 was born into that world, where larger files would’ve meant the user had to pay more just for downloading that track.
Imagine what it would’ve been in the 90’s, if PCM audio was served for the public? One album in PCM could take a day to download. Information traffic in the Internet in the start of the 90’s was about 0.01 petabytes (ten thousand gigabytes) per month, but in the end of year 1999 it had risen up to 28 petabytes per month.
However, nowadays, as storage space is either relatively free or extremely cheap, there’s no point in trying to store audio in lossy formats with low bitrates. Even though data transmission costs are proportional to the amount transferred, the costs are so tiny it will cost you the almost the same money to stream a PCM file at 1,411 kbps as it would cost to stream the mp3 at 128 kbps. Even for service providers the bandwidth cost is negligible — Spotify, who streams files in a lossy format costs more than Qobuz, that streams files in PCM formats.
However, in some scenarios compressed audio is still the best choice. For example most Bluetooth headphones still require lower-bitrate audio, mainly because receiving less data consumes less power. There are lossless formats such as aptX Lossless (it’s only lossless up to CD-quality and is limited by Bluetooth bandwidth), but the vast majority of end devices still use AAC or aptX.
Even though the whole audio infrastructure from file creation to end-device through streaming services and smartphones technically allow using linear PCM, there’s rarely a good reason for doing that. Using high bitrates for lossy codecs means there’s more “space in the shopping basket” to fill in the data that would’ve otherwise be left out. For example, Apple AAC at approximately 280-320 kbps is considered as good as lossless for listening purposes. MP3 performs very well with high bitrates, though AAC will always outperform MP3 if the same bitrate was used. It’s a little like driving a car. You know it will go 160 mph, but driving that fast, just because it’s possible, is a bit stupid.
Using relatively high bitrates for lossy encoding gives us very good results. For example, here’s the same track encoded with Apple’s AAC encoder at 448 kbit/s.
Just because a psychoacoustics theory tells some audio will be left unheard, doesn't mean it’s always correct, or that the encoding always gets it right. Encoding, especially with low bitrates, introduces sounds that weren’t in the original track. These encoding artifacts sound different depending on encoder used, but two main artifacts usually appear in all of them. First, the most noticeable is “chirping”, which can be easily noticed on cymbals and other high-end transient instruments. The other common artifact is “growling”, which will usually be masked well enough by the actual music data. Below you will hear three separate tracks, all with 96 kbit/s encoding using mp3, AAC and Opus encoders.
MP3:
AAC:
Opus:
For comparison this is the original PCM track:
If you haven’t listened to encoding artifacts as a way of spending time, like some of us apparently, you might not notice anything, or might think it’s “weird”. You’re on the right track. Because the psychoacoustic model works with how we’re hearing, there’s not usually anything you can instantly point out missing. However, when listening to the artifacts alone, there’s suddenly a lot to hear. Below is a track, where only the encoding artifacts of MP3 at 160 kbps are heard. This track is the difference between the original and encoded track, and it will only include sounds that were in the original track, but weren’t in the encoded version (omission artifact) and sounds that were not in the original, but were added to the encoded track (encoding artifacts). The former is caused by the psychoacoustic model deciding “this is unnecessary data, leave it out” and the latter is the same model failing at predicting the sound’s signature.
Artifacts: (NOTE! This is quieter than the other tracks. If you turn up volume, remember to lower it after the track.)