UHG
Search
Close this search box.

Top 6 Recent Updates that Will Transform the Course of AI

Google, Meta, OpenAI, and Adobe are among the top publishers of the week. 

Share

AI updates
Table of Content

The past two weeks have been exceptionally crazy in terms of the new AI updates pouring in. We decided to curate the top six frameworks and models that were released recently.

ActAnywhere: Subject-Aware Video Background Generation

Adobe Research and Stanford University have introduced Act Anywhere, a generative model addressing the challenge of aligning video backgrounds with foreground subject motion in the film industry and visual effects field. This model automates the typically labour-intensive process by leveraging large-scale video diffusion models. 

It takes a sequence of foreground subject segmentation and a condition frame describing the desired scene as input, producing a realistic video with coherent foreground-background interactions. 

Trained on a large-scale dataset of human-scene interaction videos, results show that ActAnywhere performs well compared to baselines and proves its ability to handle diverse out-of-distribution samples, including non-human subjects.

GALA

Meta has always tried to make its avatars better across its different platforms like Facebook, Instagram and WhatsApp. So, Codec Avatars Lab of Meta collaborated with Seoul National University to introduce GALA, a framework that converts a single-layer clothed 3D human mesh into fully-layered 3D assets, allowing the creation of diverse clothed human avatars in various poses. 

Unlike existing methods that treat clothed humans as a single-layer geometry, GALA is based on the compositionality of humans with hairstyles, clothing, and accessories, enhancing downstream applications. Decomposing the mesh into separate layers is challenging due to occlusions, and even with successful decomposition, the poses and body shapes are often not real-life-like. 

To overcome this, the researchers used a pre-trained 2D diffusion model as a prior for geometry and appearance. The process involves segmenting the input mesh using 3D surface segmentation from multi-view 2D segmentations, synthesising missing geometry in both posed and canonical spaces with a new pose-guided Score Distillation Sampling (SDS) loss, and applying the same SDS loss to texture for complete appearance. This results in multiple layers of 3D assets in a shared canonical space, normalised for poses and human shapes, facilitating easy composition of novel identities and poses.

Lumiere

In an effort to address the challenge of creating realistic, diverse, and coherent motion in synthesised videos, Google has come up with Lumiere, a text-to-video model, made in partnership with Weizmann Institute, Tel-Aviv University and Technion. The training involved a Space-Time U-Net architecture, which generates the entire video duration in one go, unlike existing models that use distant keyframes and temporal super-resolution. 

By combining spatial and temporal processing and leveraging a pre-trained text-to-image model, the system directly produces full-frame-rate, low-resolution videos. It excels in text-to-video tasks, like image-to-video and stylised generation. The model demonstrates state-of-the-art text-to-video results and is versatile for tasks like image-to-video, video inpainting, and stylised generation. 

However, it currently cannot handle videos with multiple shots or scene transitions, and further research is needed in those areas. Despite some limitations, the focus of this project is on empowering users to creatively and flexibly generate visual content.

Meta-Prompting

In yet another interesting research paper, OpenAI and Stanford University teamed up to present meta-prompting, an effective scaffolding technique to enhance language models (LMs) performance in a task-agnostic manner. This is done by turning them into versatile conductors that can manage multiple independent queries. Meta-prompting is task-agnostic, simplifying user interaction without requiring detailed instructions.

Experiments with GPT-4 show the superiority of meta-prompting over traditional methods, achieving a 17.1% improvement over standard prompting, 17.3% over dynamic prompting, and 15.2% over multi-persona prompting across tasks like the Game of 24, Checkmate-in-One, and Python Programming Puzzles.

Using clear instructions, meta-prompting guides the LM to break down complex tasks into smaller subtasks which are then handled by specialised instances of the same LM, each following tailored instructions. The LM acts as a conductor, ensuring smooth communication and effective integration of outputs. It also leverages critical thinking and verification processes to refine the results. This collaborative prompting allows a single LM to act as an orchestrator and a panel of experts, improving performance across various tasks.

Self-Rewarding Language Models

In a recent research paper by Meta and NYU, self-rewarding language models have been introduced which do not rely on reward models derived from human preferences, which may be limited by human performance and cannot improve during training. These models can align themselves by evaluating and training on their outputs and use the language model itself to generate rewards through LLM-as-a-Judge prompting.

The method involves iterative training, where the model generates its preference-based instruction data by assigning rewards to its own outputs using LLM-as-a-Judge prompting. The results show that this training improves the model’s ability to follow instructions and improves its reward-modelling across iterations. 

Gaussian Adaptive Attention is All You Need

This study introduces the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM) and the Gaussian Adaptive Transformer (GAT) to improve model performance and contextual representation, especially with highly variable data. GAAM incorporates learnable mean and variance into its attention mechanism, structured within a Multi-Headed framework. This setup allows GAAM to collectively represent any Probability Distribution, enabling the ongoing adjustment of the importance of features as needed.

The study also introduces the Importance Factor (IF) for enhanced model explainability. GAAM, a new probabilistic attention framework, and GAT are proposed to facilitate information compilation across speech, text, and vision modalities. It surpasses state-of-the-art attention techniques in model performance by identifying key elements within the feature space. 

The paper has been published by the James Silberrad Brown Center for Artificial Intelligence, Carnegie Mellon University, Stanford University and Amazon. 

📣 Want to advertise in AIM? Book here

Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.
Flagship Events
Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.