🍒 Cherry Lip Sync

Nathan Whitehead

Boise Code Camp 2025

<a href="https://github.com/nwhitehead/cherry-lip-sync">
 <svg xmlns="http://www.w3.org/2000/svg" width="48" height="48" viewBox="0 0 24 24"><path fill="currentColor" d="M12 2A10 10 0 0 0 2 12c0 4.42 2.87 8.17 6.84 9.5c.5.08.66-.23.66-.5v-1.69c-2.77.6-3.36-1.34-3.36-1.34c-.46-1.16-1.11-1.47-1.11-1.47c-.91-.62.07-.6.07-.6c1 .07 1.53 1.03 1.53 1.03c.87 1.52 2.34 1.07 2.91.83c.09-.65.35-1.09.63-1.34c-2.22-.25-4.55-1.11-4.55-4.92c0-1.11.38-2 1.03-2.71c-.1-.25-.45-1.29.1-2.64c0 0 .84-.27 2.75 1.02c.79-.22 1.65-.33 2.5-.33s1.71.11 2.5.33c1.91-1.29 2.75-1.02 2.75-1.02c.55 1.35.2 2.39.1 2.64c.65.71 1.03 1.6 1.03 2.71c0 3.82-2.34 4.66-4.57 4.91c.36.31.69.92.69 1.85V21c0 .27.16.59.67.5C19.14 20.16 22 16.42 22 12A10 10 0 0 0 12 2"/></svg>
 nwhitehead/cherry-lip-sync
 </a> 
 <a href="https://bsky.app/profile/shimmermathlabs.com">
 <svg xmlns="http://www.w3.org/2000/svg" width="48" height="48" viewBox="0 0 24 24"><path fill="currentColor" d="M12 11.388c-.906-1.761-3.372-5.044-5.665-6.662c-2.197-1.55-3.034-1.283-3.583-1.033C2.116 3.978 2 4.955 2 5.528c0 .575.315 4.709.52 5.4c.68 2.28 3.094 3.05 5.32 2.803c-3.26.483-6.157 1.67-2.36 5.898c4.178 4.325 5.726-.927 6.52-3.59c.794 2.663 1.708 7.726 6.444 3.59c3.556-3.59.977-5.415-2.283-5.898c2.225.247 4.64-.523 5.319-2.803c.205-.69.52-4.825.52-5.399c0-.575-.116-1.55-.752-1.838c-.549-.248-1.386-.517-3.583 1.033c-2.293 1.621-4.76 4.904-5.665 6.664"/></svg>
 @shimmermathlabs.com
 </a> 
 <a href="mailto:nwhitehe@gmail.com">
 <svg xmlns="http://www.w3.org/2000/svg" width="48" height="48" viewBox="0 0 24 24"><path fill="currentColor" d="M20 4H4c-1.1 0-1.99.9-1.99 2L2 18c0 1.1.9 2 2 2h16c1.1 0 2-.9 2-2V6c0-1.1-.9-2-2-2m0 4l-8 5l-8-5V6l8 5l8-5z"/></svg>
 nwhitehe@gmail.com
 </a> 
 
 Slides at
 <a href="https://nwhitehead.github.io/slides/">
 nwhitehead.github.io/slides/
 </a> 
 <aside class="notes">
Some links. Also the slides I'm presenting are available on github.
 </aside>

<!-- ---
## Agenda

1. Why lip sync?
2. Data where?
3. Model how?
4. Deployment
5. Final thoughts -->

---
## Why lip sync?

<aside class="notes">
<ul>
<li>My original motivation was that I was working on a project with voice recognition and text-to-speech voice synthesis.</li>
<li>I wanted to connect make avatars talk on screen as a kind of "virtual tutor"
for education.</li>
<li>The hardest part was getting the lip sync looking
reasonable. My initial attempts looked terrible.</li>
</ul>
</aside>
---
Cinema 
100 years of talking characters

<img width="180" src="gfx/kentucky.png" /> My Old Kentucky Home (1926) 
 <img width="180" src="gfx/snowwhite.png" /> Snow White (1937) 
 <img width="180" src="gfx/chihiro.png" /> Spirited Away (2001) 
 <img width="180" src="gfx/transformer.png" /> Transformers One (2024)

Note:
* As soon as films had sound, animators wanted talking and singing characters
* Lip sync starts at ad hoc drawing
* to detailed research using photographs and experiments for realism
* to stylized choices which simplify animation but keep emotion
* to digital motion capture and 3D rendering
* Here everything is pre-rendered.

---
Video Games 
30 years of talking characters

 <img width="180" src="gfx/fullthrottle.jpg" /> Full Throttle (1995) 
 <img width="180" src="gfx/alyx.jpg" /> Half Life 2 (2005) 
 <img width="180" src="gfx/mafia2.jpg" /> Mafia 2 (2010) 
 <img width="180" src="gfx/last.png" /> The Last of Us: Part II (2020)

Note:
* Games started with voice acting and pixel graphics
* Moved to low poly count 3D models
* Then high poly count 3D models
* Modern games have arguably better lip sync technology than movies (longer content, high player expectations)
* Games often have recorded animations, rendered in realtime.

---
VTubers 
10 years of digital puppetry

 <img width="130" src="gfx/kizunaai.png" /> Kizuna AI (2016-) 
 <img width="180" src="gfx/opera.png" /> Ironmouse (2017-) 
 <img width="180" src="gfx/gawr.png" /> Gawr Gura (2020-) 
 <img width="150" src="gfx/kuzuha.png" /> Kuzuha (2020-)

Note:
* VTubing term popularized by Kizuna AI, but VTubing goes back further
* Live-streamed performances, realtime animation
* Can involve singing/performance, talking, commentary on games

---
What do kids want to be when they grow up?
(_Japan_)

 Teacher 6.5% 
 Illustrator 5.8% 
 Singer 5.2% 
 VTuber 4.6% 
 Actor 4.3% 
 YouTuber 3.5% 
 Doctor 3.5% 
 Idol 3.5% 
 Musician 3.4% 
 Civil... 3.2% 

<a href="https://kids.nifty.com/research/work_20250102/">https://kids.nifty.com/research/work_20250102/</a>

Note:
* If you are older like me you might think this is a fringe thing
* Actually hugely popular, especially with younger audience
* In Japan more aspirational than YouTuber

---
What do kids want to be when they grow up?
(_USA_)

 V/YouTuber 29% 
 Teacher 26% 
 Athlete 23% 
 Musician 19% 
 Astronaut 11% 

<a href="https://www.prnewswire.com/news-releases/lego-group-kicks-off-global-program-to-inspire-the-next-generation-of-space-explorers-as-nasa-celebrates-50-years-of-moon-landing-300885423.html">https://www.prnewswire.com/news-releases/lego-group-...-300885423.html</a>

Note:
* Data for USA does not distinguish YouTuber and VTuber
* MOST popular occupation for elementary aged kids now

---

> allowing yourself to speak through a character not only makes it more comfortable but also helps you find your voice as well
>
> watching your own character babble to your own words in real time is always fun
>
> — <cite>luna olmewe (creator of Veadotube)</cite>

Note:
* This is the attraction of VTubing compared to face recording
* Iron Mouse says it's like a "superhero costume"

---

### Veadotube

<a href="https://veado.tube/">https://veado.tube/</a>

Note:
* Lets you organize layers of images into states
* State machine logic in graph at bottom
* Audio volume can flip between "mouth closed" and "screaming" state

---
### 2D Lip Sync

Input
 
 <img src="gfx/output.png" />

Note:
* For our model, we want audio input
* Output is which lip shape to show at which time

---
### Output

2D lip sync uses 4 to 16 visemes.

Note:
* Output can be represented as timestamps and viseme
* Example set of visemes (one possible style)
* I picked 12 to match common practice for hand animation

---
## Data where?
* Audio
* Annotations

Note:
* Now 2D lip sync problem is defined, need training data
* We need audio data
* We need output annotations for that audio

---
### Audio

LibriSpeech
* 1000 hours of reading public domain books
* Mostly English
* 2500 unique speakers
* Gender balanced
* Transcribed, segmented, filtered

<a href="https://www.openslr.org/12">https://www.openslr.org/12</a> (CC-BY) 😺

Note:
* Other options also available
* Mozilla has huge public dataset of voice recordings

---
### Rhubarb lip sync

<figure>
<img src="gfx/thimbleweed.png" width="500px"/>
<figcaption>Thimbleweed Park (2014)</figcaption>
</figure>

* audio → phonemes (pre deep learning)
* phonemes → visemes (fixed algorithm)
<a href="https://github.com/DanielSWolf/rhubarb-lip-sync">https://github.com/DanielSWolf/rhubarb-lip-sync</a> (MIT BSD) 😺

Note:
* Thimbleweed Park was part of retro adventure game resurgence
* Rhubarb created to allow smaller dev team to make game with voice acting
* Recognizing phonemes is done with older algorithm, not as good as newer ones
* Fixed algorithm has many heuristics and rules, searches for best answer

---
### Oculus Lip Sync
<img src="gfx/lips.gif" />

* audio → viseme weights
* _Unity_ and _Unreal Engine_ plugins

<a href="https://developers.meta.com/horizon/documentation/native/audio-ovrlipsync-native">https://developers.meta.com/horizon/documentation/native/audio-ovrlipsync-native</a>

(SDK license, closed source model) 😐️

Note:
* SDK for developing virtual reality apps
* Freely available, but deprecated
* Doesn't support all platforms

---
### Adobe Animate
<img src="gfx/adobe_animate.webp" />

* audio → keyframes

<a href="https://arxiv.org/abs/1910.08685">D. Aneja, W. Li. Real-Time Lip Sync for Live 2D Animation (2019)</a>

(Patented proprietary model) 😭

Note:
* Relatively new functionality in Animate

---
### MeloTTS

<div class="container">
 <div class="col">
 <img src="gfx/vits_infer.png" />
 </div>
 <div class="col">
<ul>
<li>text → phonemes → timing → audio</li>
<li>phonemes → visemes (fixed algorithm)</li>
</ul>
 </div>
</div>
<a href="https://github.com/myshell-ai/MeloTTS">https://github.com/myshell-ai/MeloTTS</a> (MIT) 😺

Note:
* Text to phonemes here uses standard libraries
* Picking timing information is learned from annotated audio data
* I hacked the MeloTTS implementation to output phoneme timing information

---
### Phoneme Map code

```python
visemes = { 'X', 'A', 'B', 'C', 'D', 'E', 'F', 'G',
            'H', 'I', 'J', 'K' }

# Map goes from phoneme to list of visemes.
p2v_map = {
    'AH': ['D'],
    'AO': ['D'],
    'AW': ['D', 'F'],
    'AY': ['D', 'I'],
    'B': ['A'],
    ...
}
```

Note:
* Some phonemes like AY can involve lip shape changes
* Some phonemes work with more than 1 viseme (mouth more open or closed etc.),
making choices here

---
```python
for i, (ph, t) in list(enumerate(zip(phone_list, timestamps))):
    if ph in p2v_map:
        v = p2v_map[ph]
    else:
        v = ['sil']
    visemes.extend(v)
    end_time = timestamps[i + 1]
    n = len(v)
    # Evenly space timing if more than 1 viseme
    vtimestamps.extend([
        round(t + i / n * (end_time - t))
        for i in range(n)
    ])
```

Note:
* This code maps from phoneme timing information to viseme timing information

---
### Audio to Vectors

<div class="container">
 <div class="col">
 <img src="gfx/stft_output.png" width="600px"/>
 </div>
 <div class="col">
 
 Resample to 16 kHz 
 Window length 25 ms 
 Hann window 
 Hop length 10 ms 
 Group FFT into 13 bins 
 100 vectors per second 
 Research into voice audio goes back to invention of telephone, all this stuff is just package defaults in <code>torchaudio</code>.
 
 </div>
</div>

 Diagram from <a href="https://www.mathworks.com/help/dsp/ref/dsp.stft.html">https://www.mathworks.com/help/dsp/ref/dsp.stft.html</a>

Note:
* Could give audio samples directly to model, but there is lots of work
on how audio works and on human voice in particular.
* Map audio samples into overlapped windowed chunks that are
converted to frequency domain and grouped into bins.

---
### Building Dataset

| | Data | Creation time |
| -------- | ---- | ------------- |
| Manual annotation | 30 sec | 8 hours |
| MeloTTS tweaks | 10 min | 8 hours |
| Tool output review | 60 min | 8 hours |

Total about 1 hour of training data. (235 MB)

Note:
* I spend a day doing each row.
* Tweaking MeloTTS was choosing good mapping for p2v for each sentence, tweaking various filter
parameters.
* Output review was doing bulk conversion of 5 minutes of audio at a time, then cutting out bad parts.

---
```python
import torchaudio
import pandas as pd
# ...
samples, rate = torchaudio.load('./data/audio-600.mp3')
samples = samples.numpy()[0] # Take left channel samples
# ...
entries = []
visemes = []
# ... Parse XML file output in loop
 visemes.append(viseme)
# ... Fixup visemes to be at 100 Hz
entries.append({ 'audio': samples, 'visemes': visemes })
# ...
data = pd.DataFrame(entries)
data.to_parquet('./data/lipsync.parquet')
```
<a href="https://pandas.pydata.org/">https://pandas.pydata.org/</a>

Note:
* Did some work to take directories full of `.mp3` and `.tsv` files into parquet
* I know almost nothing about `pandas` and `parquet` and got it working in 1 day
* Very nice to have 1 file with all training data

---
## How lip sync?

Let's build up the full model in PyTorch

and speedrun the history of deep learning.
---
### Deep Learning

<a href="https://udlbook.github.io/udlbook/">https://udlbook.github.io/udlbook/</a> (free) 😺

Note:
* Deep Learning is a vast field, lots of stuff changing.
* This book is good, somewhat dense mathematically.
* pytorch.org has lots of docs, examples if you don't care about math.

---
### Supervised Learning

* Model maps input → output
* Universal function approximation

Note:
* "AI" is kind of big nebulous term
* "Machine learning" is a bit more specific, give data to a model and update model
* "Deep learning" is more specific, stack neural layers in the model
* "Supervised" means our training data has outputs
* Big enough model can learn any function

---
### Smallest Model - Fit a Line

* One number input
* One number output

(Gauss, 1800)

Note:
* Start with simplest model, line fitting 2D points

---
<video data-autoplay src="gfx/SGD.mp4" poster="gfx/SGD.png"></video>

Note:
* Model is a line
* Initial guess for model parameters not great
* Loss function measures how bad the model is doing for some set of inputs
* Update rule uses calculus to nudge model parameters in right direction
* Subtraction of derivatives means we try to nudge loss down
* After a few steps the model is doing better
* You don't need to understand the math, just for fun

---
<video data-autoplay src="gfx/Loss.mp4" poster="gfx/Loss.png"></video>

Note:
* Loss starts big, gets smaller quickly
* Starts leveling off after more training
* Never gets to 0, best line does not fit all points

---
### PyTorch Code

```python
class SimpleModel(torch.nn.Module): 
    def __init__(self): 
        super().__init__()
        # Model has single linear layer
        # 1 scalar input, 1 scalar output
        self.layer = torch.nn.Linear(1, 1)

def forward(self, x):
        return self.layer(x)
```
---
```python
# Instantiate our model
model = SimpleModel()
# Define loss function
criterion = torch.nn.MSELoss()       # Mean-squared error
# Create SGD-type optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(3000):
    optimizer.zero_grad()            # Clear gradients
    pred_y = model(data_x)           # Get predictions
    loss = criterion(pred_y, data_y) # Compute loss
    loss.backward()                  # Compute gradient
    optimizer.step()                 # Update model parameters
```
Note:
* Made some choices, such as learning rate, max epochs

---
### What is PyTorch doing for us?

🧱 Providing blocks (e.g. <code>torch.nn.Linear</code>) 
 🤡 Random model initialization 
 👜 Batching, streaming data 
 🪄 Computing gradients automagically 
 ⚡ Updating model 
 😎 No math required 

---
### Artificial Neural Network

* Multiple inputs
* Multiple outputs
* Non-linear activation

(Rosenblatt, 1950)
---
<video data-autoplay src="gfx/LinearScene.mp4" poster="gfx/LinearScene.png"></video>

Note:
* Inputs on left
* Gray lines indicate how much weight each connection has
* Right side does weighted sum, then passes through sigmoid function (called activation)
* Maybe easier to understand as bottom formula with matrix

---
<video data-autoplay src="gfx/LinearScene2.mp4" poster="gfx/LinearScene2.png"></video>

Note:
* Abstract the neural network idea into one box

---
<video data-autoplay src="gfx/MultiInput.mp4" poster="gfx/MultiInput.png"></video>

Note:
* We will also need a version that can do two inputs.
* Use a different matrix for each input.
* Inputs can be different sizes, just outputs have to match.

---
```python
class NeuralNetModel(torch.nn.Module): 
    def __init__(self): 
        super().__init__()
        # Input length 6, output length 5
        self.linear = torch.nn.Linear(6, 5)
        # Sigmoid activation
        self.activation = torch.nn.Sigmoid()

def forward(self, x):
        return self.activation(self.linear(x))

```
---
### Multi-Layer Neural Network

* Stacked layers
* Linear / non-linear activations alternate

(Hinton, 1986)

Note:
* One layer is limited
* Stacking layers makes model more capable

---
<video data-autoplay src="gfx/MLP.mp4" poster="gfx/MLP.png"></video>

Note:
* Stack layers, output of upper layer matches input of next layer
* Input is 783 pixels from 28x28 grid
* Output is 10 values, with 1 on correct digit (called one-hot encoding)
---
```python
class MLP(torch.nn.Module): 
    def __init__(self): 
        super().__init__()
        self.net = torch.nn.Sequential(
            torch.nn.Linear(784, 200),
            torch.nn.Sigmoid(),
            torch.nn.Linear(200, 200),
            torch.nn.Sigmoid(),
            torch.nn.Linear(200, 10),
            torch.nn.Sigmoid(),
        )

def forward(self, x):
 return self.net(x)
---
<!-- ### MNIST

* Digit recognition of 28x28 grayscale images
* MLP trains in a couple minutes on CPU
* Ignores spatial structure

<a href="https://en.wikipedia.org/wiki/MNIST_database">(MNIST 1994)</a>
--- -->
### Deep Learning

<div class="container">
 <div class="col">
 <img src="gfx/xkcd_1838_machine_learning.png" />
 <a href="https://xkcd.com/1838/">https://xkcd.com/1838/</a>
 </div>
 <div class="col">
 
 How do we use spatial structure? 
 How do we train deeper networks? 
 
 🕗 How do we use time?
 
 </div>
</div>

Note:
* At this point things start to become "deep learning".
* At heart a pile of linear algebra with data flowing in.

---
### Time Series

Note:
* We have vectors coming in every 0.01 second.
* Might need to keep track of what happened in past somehow.
* Add explicit "current state" H.
* Our model takes in H and X, outputs new state.

---
### GRU - Gated Recurrent Unit
* Explicit internal state
* Old state and current input → candidate state
* Reset gate - How much to ignore old state?
* Update gate - How much to update state?

(Cho et al., 2014)

Note:
* Line of research figuring out what to put in the box.
* GRU is one nice answer.
---
<video data-autoplay src="gfx/GRUScene.mp4" poster="gfx/GRUScene.png"></video>

Note:
* Build up more complicated units out of previously defined units and simple math operations.
* Block diagram on left is same as equations on right.

---
<video data-autoplay src="gfx/GRUBoxScene.mp4" poster="gfx/GRUBoxScene.png"></video>

Note:
* Abstract the complexity into one box again.
---
<video data-autoplay src="gfx/LipSyncGRU.mp4" poster="gfx/LipSyncGRU.png"></video>

Note:
* Our first model is just one GRU and a linear layer
* The larger internal size 32 gives model "space to think"
* Different contexts might require same viseme
---
```python
class SelectItem(nn.Module):
    # ...constructor takes 1 arg 'index'
    def forward(self, inputs):
        return inputs[self.index]

class LipSyncGRU(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            # batch_first means input is (Batch, Time, D_in)
            nn.GRU(26, 32, batch_first=True),
            SelectItem(0), # Take first output
            nn.Linear(32, 12))
    def forward(self, x):
        return self.net(x)
```

Note:
* The GRU layer has multiple outputs, so some bookkeeping

---
### Training LipSync (v1) CPU

00:02:38 on AMD Ryzen 5 3600 (4.1 GHz)

Note:
* Demo is real recording, sped up.
* Server is a 5 year old gaming rig, perfectly fine.

---
### Logging and Validation
```python
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
# ... setup
for epoch in range(epochs):
    # ... normal training, use data_x and data_y (90%)
    # Validate, use validate_x and validate_y    (10%)
    with torch.no_grad(): # turn off learning
        pred_y = model(validate_x)
        validate_loss = criterion(pred_y, validate_y)
    # Log how we are doing
    writer.add_scalars('Loss', {
        'Train': training_loss,
        'Validate': validate_loss,
    }, epoch)
```

Note:
* Want to plot loss on training set.
* Split dataset into 90% training and 10% validation.
* Validation lets us see if our training is working.
* Use tensorboard to log these.
* Validation loss is same as training but without grads.

---
### TensorBoard

Note:
* Training seems to be working.
* Validation loss levels out a bit higher than training loss.

---
### Torch: GPU

```python
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available()
                      else 'cpu')
# ...
# OLD
data_x = data['spectrogram']
data_y = data['visemes']

# NEW
data_x = data['spectrogram'].to(device)
data_y = data['visemes'].to(device)
```

Note:
* Lets make it faster by running on GPU.
* In torch this is easy.
* Detect appropriate device, then move data to device.

---
### Training LipSync (v1) GPU

00:01:01 on NVIDIA RTX 3090 FE

Note:
* More than 2x faster, would also be this fast on smaller GPUs.

---
### Tweak it

* Bigger hidden state?
* Stack more layers?
* More bins in spectrogram on input?

Note:
* To improve model, we can try changing things and retraining.
* After a few attempts, updated version uses 2 GRU layers and
state of size 80.
---
<video data-autoplay src="gfx/LipSyncGRU2.mp4" poster="gfx/LipSyncGRU2.png"></video>
---
### LipSync (v2)

```python
class LipSyncGRU(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.GRU(26, 80, batch_first=True),
            SelectItem(0),
            nn.GRU(80, 80, batch_first=True),
            SelectItem(0),
            nn.Linear(80, 12),
        )

def forward(self, x):
        return self.net(x)
```

Note:
* Not too many changes in model code.
* Second GRU is 80, 80 instead of 26, 26.

---
### TensorBoard

Overfitting!

Note:
* This time we see a problem.
* Validation bottoms out at epoch 30-40, then rises.
* Training loss keeps going down.
* This indicates our model thinks it is doing a better and better job,
but actually it is not doing a better job on data it has not seen.
* It is memorizing the training data.

---
### Extra tricks

* Dropout
    * Randomly cut out parts of data.
    * Practice with random players benched.
    * _Force team to not depend on star players._

```python
# Dropout layer with 20% coverage
# Put anywhere, fits any shape data
nn.Dropout(p=0.2)
```

Note:
* Dropout layers only affect training.
* Hopefully prevents memorizing input.

---
### Extra Tricks

* `BatchNorm`
    * Automatically scale input vectors.
    * Treats each channel independently.
    * _Like an audio compressor._

```python
# Input is (Batch, Time, D_in)
Permute((0, 2, 1)),
# BatchNorm1d needs (Batch, D_in, Time)
nn.BatchNorm1d(input_size),
# Go back to normal
Permute((0, 2, 1)),
```

Note:
* Normalization can help the model focus on
what is important, not get distracted by big
numbers.

---
### LipSync (v3)

```python
self.net = nn.Sequential(
    Permute((0, 2, 1)),      # NEW BatchNorm
    nn.BatchNorm1d(26),      # NEW BatchNorm
    Permute((0, 2, 1)),      # NEW BatchNorm
    nn.Dropout(p=0.2),       # NEW Dropout
    nn.GRU(26, 80, batch_first=True),
    SelectItem(0),
    nn.Dropout(p=0.2),       # NEW Dropout
    nn.GRU(80, 80, batch_first=True),
    SelectItem(0),
    nn.Linear(80, 12),
)
```
---
### TensorBoard

Note:
* Validation loss levels out now

---
### Demo

Note:
* Demo is my own hand-drawn frames.
* Audio is a synthetic voice that had trouble in Rhubarb (original motivation for project).
* Created with Python script.

---

## Deployment

---
### Saving and Loading

```python
# After training, save the model
torch.save(model.state_dict(), 'model.pt')
# ... in a different script
# Load the model
model = LipSyncGRU() # at this point it has random weights
modem.load_state_dict(
    torch.load('model.pt', weights_only=True,
        # Switch to using CPU after training on GPU
        map_location=torch.device('cpu'))
)
```

Note:
* First step is to distribute model weights after training.
* This requires user of model to have definition of model in torch.
* Simple, popular on huggingface.

---
### Export to ONNX

```python
# ... Load model from weights
# Turn off training
model.eval()
# Send through one round of random input
x = torch.randn(1, max_time, feature_dims)
_ = model(x)
# Save as ONNX file
torch.onnx.export(model, x, 'model.onnx',
    export_params=True,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': { 1: 'time' },
        'output': { 1: 'time'}})
```

Note:
* ONNX is model exchange format, standardized
* Best for smaller models (not great for LLMs yet)
* Includes description of architecture of model and trained weights.

---
### ONNX Runtime (ORT)

* Standard model interchange format
* All languages can use it, CPU, GPU
* Works in the browser with HW acceleration (!)

(Getting lip sync working in browser in realtime is work-in-progress.)
---

## Final Thoughts

---

### Model Tips

🧩 Match input/output shapes 
 ⛔ Avoid NANs (but math is OK) 
 🔪 Don't do in-place modification 
 
 <img src="gfx/bench.png" /> Use all the tricks (dropout, BatchNorm) 

---

### Deep Learning Plan

* Start with data
* Clean your data
* Compact your data
* Quick turnaround means progress (<10 min)
* Small models can be awesome

---
### Recommended tech

```python
uv          # Python environment management
python      # For training and model development
torch       # Deep learning framework
torchaudio  # Audio torch stuff
torchvision # Image and video torch stuff
pandas      # Data management in python
tensorboard # Viewing training progress graphs
tqdm        # Pretty progress bars
onnxruntime # Deploying and sharing models
```

---

> Anyone can cook a machine learning model... but only the fearless can be great.
>
> — <cite>Gusteau, famous chef in **Ratatouille**, probably</cite>

---
The End
---

### Slides

<ul>
 <li>Slides made with <a href="https://revealjs.com/">RevealJS</a>.</li>
 <li>Animations made with <a href="https://www.manim.community/">Manim</a>.</li>
 <li>Recordings made with <a href="https://asciinema.org/">asciinema</a>.</li>
</ul>