Abstract
In this talk we will be talking about the complex media processing pipelines behind media AI models (translations, lipsyncing, text to video, etc).
AI Models are very picky on media specs (resolutions, frame rates, sample rates, colorspace, etc), also most of the times the AI tools that you see that, for instance, generates a fantastic and sharp video from a text prompt are really based on several AI models working in conjunction, each of them with their own constraints.
Our team is responsible for ingesting 1 BILLION media assets DAYLY, and delivering 1TRILLION views every 24h, for that we use (highly optimized) typical media processing pipelines. In this talk we will explain how we leveraged all of that experience, and building blocks, and how we added media AI inference as another offering of those pipelines, now you can upload an asset and deliver it with ABR encodings + CDN, and ALSO alter the content of that via AI (ex: add a hat to all the dogs in the scene).
And all of that trying to NOT break the bank (GPU time is really expensive) We think this talk could be useful to reveal the hidden complexities of delivering AI, specially at scale This talk was presented at Demuxed 2025 in London, a conference by and for engineers working in video. Every year we host a conference with lots of great new talks like this