Sadboi
Background & Motivation
Sadboi was a collaborative project between Marjorie (Marj) Cuerdo and me. We are both interested in how emotions affect people’s behavior and cognition, so for the final project in CM 202 - Computational Media Research, we decided to create a game that requires the player to express the right emotion at the right time.
With this game, we wanted to explore whether incorporating a player's emotions can enrich and/or better facilitate the game experience. Our system uses neural nets (NN) to detect the primary emotion in a player’s voice. This emotion label then serves as input into the game. Essentially, we created an emotional voice game controller.
In this project, I trained the NN to accurately detect emotions from a sound file and create an API that could be called from the game side. In the following sections, I will describe the overall system and the steps I took to train the NN and develop the API.
The full project, including a formal write-up and links to all created, publicly accessible Colab notebooks, can be found on its GitHub page.
Tools: Python, Google Colab, Unity Game Engine
System Description
Since our main focus was on the technical aspects of connecting the Unity project to our NN model on the backend, the game mechanics and narrative are kept simple. The game is a 2D side-scroller that leads the player through an emotionally taxing journey in the game character's life. The game progression is inspired by the 5 stages of grief (denial, anger, depression, bargaining, and acceptance) and describes the player character’s road to realizing they need to start going to therapy and care for their mental health.
At certain points throughout the game, the player encounters mental blocks, which are represented by physical blocks in the game. To get past those mental blocks, the player needs to respond with the appropriate emotion in their voice. For example, when the character gets fired from work without warning, the player’s “correct” response would be to sound surprised (e.g. “What? Why now, all of a sudden?!”). If the correct emotion is detected, the barrier is removed, and the player progresses.
To record a response, the player clicks the “Record” button in the upper left corner of the screen, speaks their response into their microphone, and presses the “Send” button to end the recording. We decided on this simple approach due to the difficulty of implementing an "always on" audio listener in a short timeframe. The UX implications of this decision are discussed below.
After recording, a .wav file is created on the client’s side, sent to the Emotion Prediction API, and processed through our emotion detection model. The API returns the label of the dominant emotion to Unity, where the result is evaluated against the correct response and displayed to the player. As this project is a proof of concept and not intended to be played in the wild, we did not implement error handling for misclassifications for this version.
Training the Neural Network
In order to predict emotion from a voice sample, we first had to create a computational model that would map the sample’s auditory characteristics to a certain emotion label. We decided to train a neural network for this task. Other machine learning techniques (e.g. Random Forest classifiers) that tend to require less computational processing power and excel at making predictions from small or unstructured datasets are great for building general models. Our task, however, focused on a very specific domain with large amounts of available data for training, so a NN was like an appropriate architecture.
Fortunately, there is a large public domain dataset of voice clips with emotion labels. We used the RAVDESS database, which contains 1440 voice clips from 24 actors saying "Kids are talking by the door" or "Dogs are sitting by the door" with eight different emotions: neutral, calm, happy, sad, angry, fearful, disgust, and surprised.
Those audio files had to be converted into a format that could be used as input into the NN. The librosa library facilitates analyzing audio files through segmentation and visualization. To make the audio data more interpretable, we first converted our files into spectrograms (see example on right), which visualize the frequencies of each sound sample. Those images served as input into the NN.
We went through multiple training cycles using the fit_one_cycle() function. Each cycle had five epochs and we modified the learning rate through reading and interpreting the learning rate graph (see example on left). The graph shows the loss, or the difference between a model’s expected output and its actual output, as a function of the chosen learning rate. The learning rate, on the other hand, determines how much the model’s parameter weights change in response to the loss gradient.
To optimize the model, the parameter weights should be set to minimize the loss. Thus, to find the optimal configuration of parameter weights, we need to set the learning rate to where the loss gradient is steepest and negative. In the example figure on the left, this would be approximately 10-3. After 5 cycles, our model achieved an accuracy of 0.809 with a baseline of 0.125 for correctly identifying the most dominant out of 8 emotions in a speech sample.
Creating the API
Once our emotion recognition model had achieved satisfactory accuracy, we had to figure out how to use this model in our game. At the time of this project, Unity did not support importing external machine learning models, but it did support web API calls from within the game. So, I created an API to run on Google Colab that would receive audio files from Unity, classify their emotional content, and return the emotion label as a text string.
To make the trained model easily and publicly accessible for the API, I uploaded it to archive.org (link to download). In a Colab notebook, I created an API that converts the received .wav file into a spectrogram and feeds this image into the emotion detection model. The model.predict() function provided by fastai returns a list of all possible classification categories (i.e. the eight emotions neutral, calm, happy, sad, angry, fearful, disgust, and surprise) along with the probability of each one being the correct label. Since the list is sorted by probability in descending order, the API simply returns the string value of the first list item to Unity. For this proof of concept, we did not implement special handling of cases where two or more emotions are very closely ranked.
One major caveat of this method is that the Colab notebook with the API needs to run in the background whenever someone plays the game. Otherwise, the API is not live and there is no endpoint to send the player recording to. However, given this project's time and resource restrictions, using a locally deployed API was a good compromise between functionality and computational efficiency.
Reflection & Post-Mortem
This project allowed Marj and me to apply our game-making and web development skills while improving our applied machine-learning skills. While I was primarily responsible for the machine learning and API components, Marj handled the Unity project and API integration.
In the end, we were able to create a proof of concept prototype that emotion-based voice input for game interaction is possible.
To further expand this project, we would like to improve our model’s accuracy so the game can better respond to the player’s intended input. We would also like to find a better option than locally hosting an API in order to make the game playable, possibly by using an online hosting service. Due to time constraints, we did not explore this option when we worked on the project.
In addition, the user experience of the game could be much improved by implementing error-handling routines and an open audio loop that continuously listens for input. Similar to Google Assistant or Siri, a wake word could signal the app the beginning of user input. While basic error handling would be relatively easy to put in place, developing an application with an open audio loop is a bigger technical challenge. Tackling this challenge would eliminate the need for separate “record” and “send” buttons the player has to press in the current version.
Lastly, once the backend data flow is improved, we could develop more complex game mechanics that involve the player’s emotional state and assess their impact on the player’s mental state. For example, we could study whether such game mechanics are able to affect a player's emotion regulation skills. Overall, we hope that the future of Sadboi can contribute to research on affect-adaptive gaming and other applications of emotion detection in games.