Microsoft just dropped VASA-1.Lifelike Audio-Driven Talking Faces Generated in Real Time

All you need is 1 image + audio file.

lip-audio sync, also capture a large spectrum of emotions and facial expression and head motions for realism and liveliness.

Single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.