Here's the video they produced:
It's spectacular, no?
Here's their technical report, Video generation models as world simulators:
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
I think the title's a bit much, but business is business.
While admitting that the videos are "spectacular...cinematic," Gary Marcus is predictably skeptical, as one should be. His conclusion:
And importantly, I predict that many will be hard to remedy. Why? Because the glitches don’t stem from the data, they stem from a flaw in how the system reconstructs reality. One of the most fascinating things Sora’s weird physics glitches is most of these are NOT things that appears in the data. Rather, these glitches are in some ways akin to LLM “hallucinations”, artifacts from (roughly speaking) decompression from lossy compression. They don’t derive from the world.
More data won’t solve that problem. And like other generative AI systems, there is no way to encode (and guarantee) constraints like “be truthful” or “obey the laws of physics”or “don’t just invent (or eliminate) objects”.
Watch the video and reach your own conclusions.
No comments:
Post a Comment