Nathan Whitehead
Boise Code Camp 2025
nwhitehead/cherry-lip-sync nathanwhitehead.bsky.social nwhitehe@gmail.com Slides at nwhitehead.github.io/slides/
My Old Kentucky Home (1926) Snow White (1937) Spirited Away (2001) Transformers One (2024)
Full Throttle (1995) Half Life 2 (2005) Mafia 2 (2010) The Last of Us: Part II (2020)
Kizuna AI (2016-) Ironmouse (2017-) Gawr Gura (2020-) Kuzuha (2020-)
Teacher 6.5% Illustrator 5.8% Singer 5.2% VTuber 4.6% Actor 4.3% YouTuber 3.5% Doctor 3.5% Idol 3.5% Musician 3.4% Civil... 3.2%
V/YouTuber 29% Teacher 26% Athlete 23% Musician 19% Astronaut 11%
Input
Resample to 16 kHz Window length 25 ms Hann window Hop length 10 ms Group FFT into 13 bins 100 vectors per second Research into voice audio goes back to invention of telephone, all this stuff is just package defaults in torchaudio.
torchaudio
Diagram from https://www.mathworks.com/help/dsp/ref/dsp.stft.html
🧱 Providing blocks (e.g. torch.nn.Linear) 🤡 Random model initialization 👜 Batching, streaming data 🪄 Computing gradients automagically ⚡ Updating model 😎 No math required
torch.nn.Linear
How do we use spatial structure? How do we train deeper networks? 🕗 How do we use time?
00:02:38 on AMD Ryzen 5 3600 (4.1 GHz)
00:01:01 on NVIDIA RTX 3090 FE
🧩 Match input/output shapes ⛔ Avoid NANs (but math is OK) 🔪 Don't do in-place modification Use all the tricks (dropout, BatchNorm)