Building a Voice-Powered Song Search

Love it or hate it, Christmas is a period for music, and that comes the frustrating scenario of knowing lyrics but not quite knowing the song. Of course, you could just search the lyrics, but where's the fun in that? In this project, we will warm up our vocal cords and use Deepgram and the Genius Song Lyrics API to build a website that should correctly guess spoken or sung lyrics.

While doing this, we'll learn how to stream microphone data to Deepgram via a server, so you don't need to worry about exposing your API Key.

This is what we'll be building:

The green area is one set of steps that gets us to the point of transcripts. The blue area covers searching for and displaying songs. Don't worry if that looks like a lot - we'll take it step by step. If you want to look at the final project code, you can find it at https://github.com/deepgram-devs/song-search.

Before We Start

You will need:

Node.js installed on your machine - download it here.
A Deepgram API Key - get one here.
A Genius API Access Token - get one here.

Create a new directory and navigate to it with your terminal. Run npm init -y to create a package.json file and then install the following packages:

npm install dotenv @deepgram/sdk express socket.io axios

Create a .env file and add the following:

DG_KEY=replace_with_deepgram_api_key
GENIUS_TOKEN=replace_with_genius_access_token

Create an index.js file, a folder called public, and inside of the public folder create an index.html file. In index.html create a boilerplate HTML file:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8" />
  </head>
  <body>
    <!-- Further code goes here -->
  </body>
</html>

Establish a Socket Connection

The socket.io library can establish a two-way connection between our server (index.js) and client (index.html). Once connected, we can push data between the two in real-time. We will use this to send data from the user's microphone to our server to be processed by Deepgram and show results from the server logic.

In the index.html <body> tag:

<script src="/socket.io/socket.io.js"></script>
<script>
  const socket = io()
  // Further code goes here
</script>

In index.js create a combined express and socket.io server and listen for connections:

// Require
const express = require('express')
const app = express()
const http = require('http').createServer(app)
const io = require('socket.io')(http)

// Configure
app.use(express.static('public'))

// Logic
io.on('connection', (socket) => {
  console.log(`Connected at ${new Date().toISOString()}`)
})

// Run
http.listen(3000, console.log(`Started at ${new Date().toISOString()}`))

For this tutorial, I would leave the comments in as I refer to sections later by their names. Start the server in your terminal by navigating to the directory and running node index.js. Open your browser to http://localhost:3000, and you should see 'Connected at date' in your terminal. Once this connection is established, we can send and listen for events on both the server and the client.

Access and Send Audio

In a blog post last month we covered how to access and retreive data from user's mic in a web browser. Each of the steps are covered there, so we'll be lifting the examples from it without a deep explanation. In index.html:

navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
  const mediaRecorder = new MediaRecorder(stream)

  mediaRecorder.addEventListener('dataavailable', (event) => {
    if (event.data.size > 0) {
      socket.emit('microphone-stream', event.data)
    }
  })
  mediaRecorder.start(1000)
})

This will immediately ask for access to the microphone and begin accessing data once permitted. When emitting events with socket.io, we can specify a specific event name which we can then listen for on the server. Here, we have called it microphone-stream and send it with the raw mic data.

Listening for Events

In index.js inside of the connection and below the console.log() statement:

socket.on('microphone-stream', (data) => {
  console.log('microphone-stream event')
})

Restart your server and then refresh your web page. Once you grant access to your microphone, you should see a steady stream of logs indicating that data is sent from your browser to the server. You may stop your server while we continue with the next step.

Setting Up Deepgram

At the top of the Require section in index.js add dotenv which will allow access to the .env file values.

require('dotenv').config()

At the bottom of the Require section require the Deepgram Node.js SDK which we installed earlier:

const { Deepgram } = require('@deepgram/sdk')

Finally, in configure, initialize the SDK and create a new live transcription service:

const deepgram = new Deepgram(process.env.DG_KEY)
const deepgramLive = deepgram.transcription.live({ utterances: true })

Getting Live Deepgram Transcripts

Inside of the microphone-stream event handler comment out the console.log(). In it's place, take the provided data and send it directly to Deepgram:

socket.on('microphone-stream', (data) => {
  // console.log('microphone-stream event')
  deepgramLive.send(data)
})

// Further code goes here

deepgramLive provides an event when Deepgram has a transcript ready, and like the browser live transcription blog post we will wait for the final transcript for each of our utterances (phrases).

let transcript = ''
deepgramLive.addListener('transcriptReceived', (data) => {
  const result = JSON.parse(data)
  const utterance = result.channel.alternatives[0].transcript
  if (result.is_final && utterance) {
    transcript += ' ' + utterance
    console.log(transcript)
  }
})

Restart your server, refresh your browser, and speak into your microphone. You should see a transcript appear in your terminal.

Triggering Song Search

Because a set of lyrics can take up multiple utterances, we need to have a way to indicate that we are finished and the search should take place. We will attach an event listener to a button that, when pressed, will emit an event.

In index.html add a <button> at the top of your <body> tag:

<button>Search Song</button>

Just below mediaRecorder.start(1000) add the following logic:

const button = document.querySelector('button')
button.addEventListener('click', () => {
  button.remove()
  mediaRecorder.stop()
  socket.emit('search')
})

When the button is pressed, it will be removed from the DOM, so we only can click it once; we stop the mediaRecorder (and, in doing so, stop emitting the microphone-stream events), and emit a new event called search.

In index.js add a new socket event listener just after the block for microphone-stream is closed:

socket.on('search', async () => {
  console.log('search event', transcript)
  // Further code here
})

Restart your server and refresh the browser. Speak a few phrases and click the button. You should see the search event take place with the final transcript logged.

Searching for Songs

We will use the Genius API to search for songs based on lyrics. To make this API call, we'll utilize Node package axios. In the Require section of our index.js file, add the package:

const axios = require('axios')

And make the API call when the search event is received:

const { data } = await axios({
  method: 'GET',
  url: `https://api.genius.com/search?q=${transcript}`,
  headers: {
    Authorization: `Bearer ${process.env.GENIUS_TOKEN}`,
  },
})
const topThree = data.response.hits.slice(0, 3)
console.log(topThree)

// Further code here

Restart your server and refresh your browser.

Yay!

Displaying Results

The final step is to show the output to the user by emitting an event from the server back to the client. Doing this is nearly identical to the other direction. In index.js:

socket.emit('result', topThree)

In index.html add an empty <ul> under the <button>:

<ul></ul>

At the bottom of the <script> tag, below all other code, listen for the results event and add items to the new list:

socket.on('results', (data) => {
  const ul = document.querySelector('ul')
  for (let song of data) {
    const li = `
    <li>
      <img src="${song.result.song_art_image_url}">
      <p>${song.result.full_title}</p>
    </li>
  `
    ul.innerHTML += li
  }
})

Before we try this add this minimal styling inside of your <head> tag:

<style>
  ul {
    display: grid;
    grid-template-columns: 1fr 1fr 1fr;
    grid-gap: 4em;
    list-style: none;
  }
  img {
    width: 100%;
  }
</style>

Restart your server, refresh your browser, and try it out! You can display any of the information provided by Genius.

No one ever said I was a good singer.

Wrapping Up

There are quite a lot of improvements you could make here:

Show utterances to users in the browser

Do searches as soon as utterances are available, and update them as more words are said

Allow multiple songs without needing to 'reset' by refreshing

Give it a festive theme

This post has also introduced you to the code required to stream your microphone from the browser to Deepgram via a server, thus protecting your API Key from being exposed.

We'll have some more posts coming out before Christmas, but from me, this is it until January, so please have a wonderful festive period and a wonderful new year. The complete project is available at https://github.com/deepgram-devs/song-search, and if you have any questions, please feel free to reach out on Twitter - we're @DeepgramDevs.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Building a Voice-Powered Song Search

Share

In this blog post

Before We Start

Establish a Socket Connection

Access and Send Audio

Listening for Events

Setting Up Deepgram

Getting Live Deepgram Transcripts

Triggering Song Search

Searching for Songs

Displaying Results

More with these tags:

Share your feedback

Related Resources

Creating Speaker-Labeled Transcripts With Google Colab

Build a YouTube Video Downloader with Python

Transcribe Audio Quickly With Google Colab and Deepgram

How to Add Speech AI Into Your Next.JS App