Audio Fingerprinting: A Basic Theory Of How Shazam Works

masterjosh
by
masterjosh
Apr 04, 2019
Audio Fingerprinting: A Basic Theory Of How Shazam Works

What is behind how Shazam works? How come it is often so accurate? What about just humming a song, is there a way to identify songs by humming alone?

The Scenario

You’re at a coffee shop (or restaurant, or club) and hear a beautiful song playing. This song may be completely new to you or may be one you heard many times as a child. But you don’t know its name. You really like this song and desperately want to find its title so you can play it on repeat later on. Fortunately, you have a phone with music recognition software (Shazam) installed.

So, you quickly launch the Shazam app, point your phone at the music source and record a few seconds of the music… and you are saved! Software has just given you not just the title of the song, but the song artist, album, release date, and a whole lot of other meta data. Now you can relax because you have the information you need to add the song to your favorite music player or playlist and hear it again and again until you become tired of it or until it becomes a part of you.

"Audio Fingerprinting - The Theory Of How Shazam Works"

The above scenario is quite common these days in our futuristic world. And most people are satisfied with just being consumers of such “magical” technology. But for the hackers, the algorithm lovers, the software developers, and the unapologetically curious ones amongst us, consuming cool technology is never enough. We want to know more! What is behind how Shazam works? How come it is often so accurate? What about just humming a song, is there a way to identify songs by humming alone?

The secret to the entire process is known as Audio Fingerprinting (or Acoustic Fingerprinting) which is itself one of the algorithms of Automatic Content Recognition (ACR). This article will provide a non-scholarly, high level, explanation of Audio Fingerprinting with Shazam as a case study.

How Shazam Works – An Oversimplified Summary

First some background. Shazam is not a new service. It was founded in 1999 and is therefore older than what we call a smartphone these days. In the early days of Shazam, users actually had to call in. However, like most old technologies that later gain explosive popularity (AI and ML, I’m thinking of you), the fundamental algorithm behind Shazam (and Audio Fingerprinting in general) is still relatively the same.

Shazam works pretty good even in noisy environments like nightclubs and bars as long as the song in question already exists in Shazam’s database. You can start recording at any point in the song no matter whether it is the intro, chorus, or verse. Shazam will create a fingerprint for the recorded sample, consult its database, and use its music recognition algorithm to tell you exactly which song you are listening to.

For this service to work well, Shazam has a growing music/audio library of more than 8 million songs (11 million according to some sources). If we assume exactly 8 million songs and assume that each song is 3 minutes long, it will take almost 46 years to finish playing each song back to back.

With a song library this huge, how does Shazam manage to search through it all and give you results in mere seconds? This is where Audio Fingerprinting comes in. When you ‘Shazam’ a song, the actual audio files in the database are not what gets searched. Instead Shazam has used its audio fingerprinting algorithm to generate audio fingerprints for the music files in its database. These audio fingerprints consists of collections of numerical data (not audio recordings).

When you hold your phone up to a song you’d like to identify, Shazam quickly turns your clip into an audio fingerprint using the same algorithm. Now it’s just a matter of numerical data search and pattern matching. Shazam searches its library for the code it created from your recording. When it finds it, it has found your song!

As developers know, numerical (or text, or string) search can be performed at ridiculous speeds.

I will not discuss the details of any actual audio fingerprinting algorithm. So, there won’t be any sample code in this article. Sorry. But here’s some more information for you curious folks…

In a Scientific American article in 2003 (not bothering to link the article since its behind a paywall), Avery Li-Chun Wang, co-founder and chief scientist at Shazam shared some interesting information about how Shazam makes the audio fingerprints. He explained that his company’s audio fingerprinting approach was long considered computationally impractical since the general consensus was that there was too much information in a song to compile a simple signature. However, as he struggled with the problem, Avery Wang had a brilliant idea: What if he ignored nearly everything in a song and focused instead on just a few relatively “intense” moments? With this approach, Shazam creates a spectrogram for each song in its database.

In basic terms, a Spectrogram is a graph that plots 3 dimensions of music – time, frequency, and amplitude.

Shazam’s audio fingerprinting algorithm then selects only the points in the graph that represent notes that contain higher energy content than all the other notes surrounding it – the peaks of the graph. As Avery Wang explained in this academic paper about how Shazam works, this audio fingerprinting technique translates to about 3 data points per second per song. While this is still a lot of data considering the number of songs out there, it is nothing compared to the size of the full audio data. And it is numerical data which is much easier to process and compare as opposed to raw audio data.

Despite ignoring nearly all of the information in a song, Shazam is still able to provide very accurate song matches. And as mentioned above, it works pretty good even in noisy environments. Shazam’s former CEO, Andrew Fisher said the company had also found a way to match music that has been sped up – as radio DJs may do to fit in a song before an ad break or as club DJs sometimes do to match a certain tempo.

Overall, from a programmer’s perspective, I think the accuracy of Shazam is pretty impressive. However, it does have its failings too – especially when there’s not enough data.

For example, you may be rushing to get your phone and ‘Shazam’ a song just as the song is ending and you may not be able to get in enough recorded seconds for Shazam to accurately identify the song. There are also errors when people try to look up live performaces on the TV. And don’t even try humming a song for Shazam to identify – no matter how great a singer you are. Shazam’s mathematical audio fingerprinting algorithm doesn’t exactly care about your vocal prowess.

Oh I can’t hum the song? What about that song stuck in my head that I want to identify?! Read on…

Beyond Shazam – Identify Songs By Humming (“Query By Humming”)

When you have a song stuck in your head that you want to identify, Shazam can’t help you. This can be very frustrating and is why you will often seen questions like these on the internet:

You can feel their pain right?

But you have a number of different options available to you. For example, there are web apps and Facebook and Reddit communities that let you sing a song for people to identify. And if you know a few lyrics of the song, there are apps that let you type in the few lyrics that you know and then the apps attempt to help you identify the song.

However, in my forever programmatic perspective, the least embarrassing, and quickest options are apps that offer you the “Query By Humming” service.

Query by humming apps enables users to sing/hum a short piece of melody of a song to your phone and then they retrieve information about the song.

Probably the most notable app that provides this service is SoundHound.

Audio Fingerprinting From The Developer’s Perspective – Use Cases. Etc.

This article has mainly focused on explaining Audio Fingerprinting from a user/consumer perspective.

However, picture another scenario… this time, you’re a developer: You have just published your new music player app and someone downloads and installs it on their device. Now they go to add some of their existing music files to your app and because the music files they uploaded is lacking proper meta data, your app has no information to display for album cover, artist, etc. Even worse, they are presented with a bunch of garbled and unstructured data that was originally part of the meta data in the files they uploaded.

The user’s poor first experience might make them leave your app and find something better.

But how can you display proper meta data and album covers correctly when the music files your users upload lack this information? Again the answer lies in Audio Fingerprinting.

The core benefit of Audio Fingerprinting is that a piece of music can be reliably identified simply by the music itself. So, as a developer, you do not need to rely on your users to manage their music meta data by themselves. Why throw such a maintenance burden on your users? It is always better to assume that users are lazy (even if they are not). That way, you push your best product and offer them the greatest user experience possible.

Conclusion/Summary

Besides the user and developer scenarios discussed above, use cases for audio fingerprinting and query by humming are very wide. Here are some popular use cases of these cool technologies:

  1. Users trying to find out the song they are listening to in a coffee shop
  2. App developers trying to deliver the best music experience
  3. Music copyright holders trying to protect use of their music
  4. Radio airplay monitoring for royalty distribution based on music usage
  5. Acoustic querying when a user can only remember part of the tune of the song
  6. Karaoke applications where query by humming can be used to enable users hum a small part of a melody to search the wanted song in large song database
  7. Query by humming can also be used to generate a score based on the similarity between a user’s and an original artist’s singing. This score could then be used to rate acappella performances based on vocal and melody similarity.




Add Comment