Accuracy of Automatic Emotion Recognition from Voice

class: center, middle, inverse, title-slide

# Accuracy of Automatic Emotion Recognition from Voice
## International Society for Research on Emotion
### Damien Dupré & Gary McKeown
### Amsterdam, July 11th 2019

---

# Automatic Emotion Recognition

Since the 1970's, automatic systems has been developed to recognize individual emotions.

With the advancement of Machine Learning techniques, multiple companies are now providing solution for automatic emotion recognition for diverse applications such as marketing, automotive, activity and health care monitoring.

Automatic emotion recognition systems can use:
 - physiological sensors and brain activity measurements
 - textual expression for sentiment analysis
 - visual capture of facial expressions and postures
 - audio capture of vocal expressions

---

# Automatic Voice Recognition Systems

Voice is one of the most important channel to communicate emotions not only thought the word used but also thought the tonality used to pronounce these words.

* First academic paper presenting an automatic system in 1996 (Dellaert, Polzin & Waibel, 1996)
* To date, 18 public repositories on GitHub https://github.com/topics/speech-emotion-recognition

Several companies are providing emotion recognition systems from voice:

* AudEERING
* Affectiva
* Neurodata Labs
* Amazon Alexa (in development)
* Cogito
* ...

---
class: inverse, center, middle

# Method

---

# Database

In order to evaluate the accuracy to recognise emotion from voice tonality we have extracted the audio track from the videos of the GEMEP-CS database (Bänziger & Scherer, 2010; Bänziger, Mortillaro & Scherer, 2012).

The Geneva Multimodal Emotion Portrayals Core Set (GEMEP-CS) database is made of video recording of:
- 10 professional French speaking theater actors (5 females and 5 males, *M*<sub>age</sub> = 37.1)
- they had to enact up to 18 emotion categories (facial expression, posture and voice)
- for each enactment they had to pronounce a pseudosentence that sounds like an unknown real language, consisting of meaningless words constituted by phonemes from several languages

---

# Why GEMEP?

* language words and sentences can be emotionally tainted

* the way words are pronounce can biased by their emotional meaning

* pseudosentence aims to remove this potential bias

---

# Emotion Categories

<table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Key </th>
   <th style="text-align:left;"> Emotion </th>
   <th style="text-align:left;"> Valence </th>
   <th style="text-align:left;"> Arousal </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Amu </td>
   <td style="text-align:left;"> Amusement </td>
   <td style="text-align:left;"> + </td>
   <td style="text-align:left;"> + </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Pri </td>
   <td style="text-align:left;"> Pride </td>
   <td style="text-align:left;"> + </td>
   <td style="text-align:left;"> + </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Joy </td>
   <td style="text-align:left;"> Elated Joy </td>
   <td style="text-align:left;"> + </td>
   <td style="text-align:left;"> + </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Rel </td>
   <td style="text-align:left;"> Relief </td>
   <td style="text-align:left;"> + </td>
   <td style="text-align:left;"> - </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Int </td>
   <td style="text-align:left;"> Interest </td>
   <td style="text-align:left;"> + </td>
   <td style="text-align:left;"> - </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ple </td>
   <td style="text-align:left;"> Pleasure </td>
   <td style="text-align:left;"> + </td>
   <td style="text-align:left;"> - </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ang </td>
   <td style="text-align:left;"> Hot anger (rage) </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> + </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Fea </td>
   <td style="text-align:left;"> (Panic) Fear </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> + </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Des </td>
   <td style="text-align:left;"> Despair </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> + </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Irr </td>
   <td style="text-align:left;"> Irritation (coldanger) </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> - </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Anx </td>
   <td style="text-align:left;"> Anxiety </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> - </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sad </td>
   <td style="text-align:left;"> Sadness </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> - </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Adm </td>
   <td style="text-align:left;"> Admiration </td>
   <td style="text-align:left;"> Additional Emotion </td>
   <td style="text-align:left;"> Additional Emotion </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Amu </td>
   <td style="text-align:left;"> Amusement </td>
   <td style="text-align:left;"> Additional Emotion </td>
   <td style="text-align:left;"> Additional Emotion </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ten </td>
   <td style="text-align:left;"> Tenderness </td>
   <td style="text-align:left;"> Additional Emotion </td>
   <td style="text-align:left;"> Additional Emotion </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Dis </td>
   <td style="text-align:left;"> Disgust </td>
   <td style="text-align:left;"> Additional Emotion </td>
   <td style="text-align:left;"> Additional Emotion </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Con </td>
   <td style="text-align:left;"> Contempt </td>
   <td style="text-align:left;"> Additional Emotion </td>
   <td style="text-align:left;"> Additional Emotion </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sur </td>
   <td style="text-align:left;"> Surprise </td>
   <td style="text-align:left;"> Additional Emotion </td>
   <td style="text-align:left;"> Additional Emotion </td>
  </tr>
</tbody>
</table>

---

# Let's Play a Game!

**Amusement!**

**Anger!**

**Anxiety!**

**Contempt!**

---

# System

AudEERING is a German company founded in 2012.

They provide openSMILE, an open-source system to perform audio feature spaces in real time as well as Emotion Recognition
https://github.com/naxingyu/opensmile

They also have developped SensAI, an API solution to analyse emotions from voice features (Eyben, Huber, Marchi, Schuller, & Schuller, 2015).

Emotions are recognized thank to a Long short-term memory (LSTM) recurrent neural network (RNN) based on sometimes more than 1000 acoustic features.

---

# System

More precisely SensAI can recognize:
1. Overall emotion
2. Probability of expressing emotion categories: passion, panic, nervousness, **disgust**, **contentment**, affection, **fear**, **irritation**, satisfaction, frustration, enthusiasm, worry, boredom, **interest**, tension, **joy**, depression, stress, **pride**, excitement, **sadness**, **anger**, relaxation, happiness
3. Probability of expressing emotion dimensions: valence, activation, dominance

***bold labels** correspond to emotions matching with those expressed in the GEMEP

---
class: inverse, center, middle

# Results

---

# Matching Overall Emotion Recognized

The average proportion of correct overall recognition by SensAI among these emotions is 7.45%

---

# Matching the Highest Emotion Label

The average proportion of correct highest label matching recognition by SensAI among these emotions is 0%

The average proportion of correct highest selected label matching recognition by SensAI among these emotions is 18.2% (CI95%[11.5%,26.7%], *p* = 0.9944).

---

# Matching Dimensional Emotion: Valence

The average proportion of correct valence matching recognition by SensAI among these emotions is 57.3% (CI95%[48.8%,65.6%], *p* = 0.9495).

---

#Matching Dimensional Emotion: Arousal

The average proportion of correct arousal matching recognition by SensAI among these emotions is 74.8% (CI95%[66.9%,81.7%], *p* = 0.0288).

---

# Discussion

* The categorical recognition of emotion remains a challenge:
  * diversity of affective states
  * heterogeneity of categories inside a language and between languages
  * overlap of categories

* However, the accuracy of a system like SensAI Emotion provides promising results in the recognition of valence and arousal

* The results of this automatic recognition system needs to be compared with other systems in order to evaluate their relative accuracy

* Different databases need to be investigated as well:
  * Sentences *vs.* Pseudosentences
  * Posed vocal expressions *vs.* Spontaneous vocal expressions
  
* Vocal expression of emotions or social message?

---

class: inverse, center, middle

# Thanks for your attention!

*Accuracy of Automatic Emotion Recognition from Voice*

.pull-left[
Damien Dupré

Dublin City University

damien.dupre@dcu.ie
]

.pull-right[
Gary McKeown

Queen's University Belfast

g.mckeown@qub.ac.uk
]