Bach Doodle Dataset

Exploring the Bach Doodle

Over 8 million players have contributed 21.6 million harmonizations after playing with the Bach Google Doodle. These harmonizations are a unique dataset that can show insights on how people around the world composed melodies, help developers train new Machine Learning algorithms, or artists create musical experiences. That’s why we’re open-sourcing the dataset.

A burst of compositions

Not every melody entered was unique! A lot of you entered your favourite pop songs, or other Bach pieces you wanted to see how the model would harmonize with. These repeated sequences surfaced through the top, and you can visualize them here. We've also split them across which country they came from, to see how compositions changed around the world!

Try exploring some of the interactive visualizations:

The technical details on how we clustered the data

We noticed early on that although people entered the same melody (like "Ode to Joy"), they sometimes did this in a different key, or with different note lengths. To get around this, we aggregated the top 2000 melodies based on the "shape" of each composition (the number of semitones between consecutive notes), rather than the absolute pitches in it, so that all "Ode to Joy"s that have the same shape were considered identical.

As a result, you will notice that all the melodies in the visualizations start on a C, even though the original melodies almost certainly didn't. This is purely to simplify the visualizations, and allow us to also explore the internal structure of melodies (like starting notes, common melody prefixes, etc). Unfortunately, this means that some of the in-app harmonizations sound worse than they did originally in the doodle, since shifting everything to a middle C might result in some notes in the melody being out of the range of those allowed in the doodle input.

What's in the actual dataset?

Of the more than 50 million requests to the doodle itself, the user contributed dataset contains over 21.6 million miniature compositions adding up to about 6 hours of audio. The compositions are split across 8.5 million sessions; each session represents an anonymous user's interaction with the web app and may contain multiple data points, one for each time the user pressed the "Harmonize" button. Explore 20 random sessions to see what the actual data looks like.