Onsets and Frames

Onsets and Frames is a deep neural network that transcribes audio of piano performances to NoteSequences/MIDI.

From a local audio file

You can use your own piano music file (i.e. actual audio, not midi) for transcription:


Actual Transcription:

It Took:

Total Leaked Memory:

From a microphone recording

You can record piano audio from a microphone and transcribe it. If you're recording something other than piano (like your voice), this will be transcribed but it will probably be fairly noisy and incorrect.


Actual Transcription:

It Took:

Total Leaked Memory:

From a test audio (250 frames / 8 seconds)

We verify the model can transcribe a short sequence of piano audio, first computing its mel spectogram.

Original Audio

Expected Transcription:

Actual Transcription:

Match:

It Took:

Total Leaked Memory:

From Mel Spectrogram (250 frames / 8 seconds)

Below we verify that the model properly transcribes a mel spectrogram to match the output of the Python TensorFlow implementation. We found it is necessary to process the convolution in batches for longer inputs to avoid a GPU timeout enforced in Chrome, and we verify that the transcription works properly with different batch sizes that exercise various edge cases.

Original Audio

Expected Transcription:

Chunk Length 250 / Batch Size 1

Actual Transcription:

Match:

It Took:

Chunk Length 150 / Batch Size 2

Actual Transcription:

Match:

It Took:

Chunk Length 80 / Batch Size 4

Actual Transcription:

Match:

It Took:

Chunk Length 62 / Batch Size 4

Actual Transcription:

Match:

It Took:

Chunk Length 50 / Batch Size 5

Actual Transcription:

Match:

It Took:

Total Leaked Memory: