And The Oscar Goes to Sora
AI is coming to Life.
OpenAI teased its new video creation model - Sora - this week.
In doing so it released a technical report and several examples of prompts and outputs.
Cautious to not over-state the end game the company said:
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
All of the videos are incredible, albeit only a minute or less each. My favorite is the Dogs in Snow video:
Although the 'Closeup Man in Glasses' is also wonderful.
I mention this because the speed at which AI is addressing new fields is - in my opinion - mind-boggling. Skills that take humans decades to perfect are being learned in months and are capable of scaling to infinite outputs using words, code, images, video, and sound.
It will take the advancement of robotics to tie these capabilities to physical work, but that seems assured to happen.
When engineering, farming, transport, or production meets AI then human needs can be addressed directly.
Sora winning an Oscar for Cinematography or in producing from a script or a book seems far-fetched. But it wasn't so long ago that a tech company doing so would have been laughable, and now we have Netflix, Amazon Prime, and Apple TV Plus regularly being nominated or winning awards.
Production will increasingly be able to leverage AI.
Some will say this is undermining human skills, but I think the opposite. It will release human skills. Take the prompt that produced the Dogs in Snow video:
Prompt: A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.
I can imagine that idea and write it down. But my skills would not allow me to produce it. Sora opens my imagination and enables me to act on it. I guess that many humans have creative ideas that they are unable to execute....up to now. Sora, DallE, and ChatGPT all focus on releasing human potential.
Google released its Gemini 1.5 model this week (less than a month after releasing Gemini Ultra 1.0). Tom's Guide has a summary and analysis by Ryan Morrison
Gemini Pro 1.5 has a staggering 10 million token context length. That is the amount of content it can store in its memory for a single chat or response.
This is enough for hours of video or multiple books within a single conversation, and Google says it can find any piece of information within that window with a high level of accuracy.
Jeff Dean, Google DeepMind Chief Scientist wrote on X that the model also comes with advanced multimodal capabilities across code, text, image, audio and video.
He wrote that this means you can "interact in sophisticated ways with entire books, very long document collections, codebases of hundreds of thousands of lines across hundreds of files, full movies, entire podcast series, and more."
In "needle-in-a-haystack" testing where they look for the needle in the vast amount of data stored in the context window, they were able to find specific pieces of information with 99.7% accuracy even with 10 million tokens of data.
All of this makes it easy to understand why Kate Clark at The Information penned a piece with the title: I Was Wrong. We Haven't Reached Peak AI Frenzy
I will leave this week's editorial with Ryan Morrison's observation at the end of his article:
What we are seeing with these advanced multimodal models is the interaction of the digital and the real, where AI is gaining a deeper understanding of humanity and how WE see the world.
Essays of the Week