V-JEPA

Video Joint Embedding Predictive Architecture (V-JEPA)...learns by processing unlabeled video and figuring out what probably happened in a certain part of the screen during the few seconds it was blacked out

Note that V-JEPA isn’t a generative model. It develops an internal conceptual model of the world. The Meta researchers say that V-JEPA, after pretraining using video masking, “excels at detecting and understanding highly detailed interactions between objects.”

The research could have big implications for both Meta and the broader AI ecosystem.

Meta has talked before about a “world model” in the context of its work on augmented reality glasses. The glasses would use such a model as the brain of an AI assistant that would, among other things, anticipate what digital content to show the user to help them get things done and have more fun. The model would, out of the box, have an audio-visual understanding of the world outside the glasses, but could then learn very quickly about the unique features of a user’s world through the device’s cameras and microphones. 

Comments

Popular posts from this blog

Perplexity

Aphorisms: AI

DeepAI's Austen on China