Researchers at Meta recently presented ‘An Introduction to Vision-Language Modeling’, to help people better understand the mechanics behind mapping vision to language. The paper includes everything from how VLMs work, how to train them, and approaches to evaluate VLMs.
This approach is more effective than traditional methods such as CNN-based image captioning, RNN and LSTM networks, encoder-decoder models, and object detection techniques. Traditional methods often lack the advanced capabilities of newer VLMs, such as handling complex spatial relationships, integrating diverse data types and scaling to more sophisticated tasks involving detailed contextual interpretations.
Although the work primarily focuses on mapping images to language, it also discusses extending VLMs to videos.
LLMs process and understand human language. Now, people are trying to use similar technology for images and videos too. This new technology is called Vision-Language Models (VLMs).
VLMs can help you navigate places you’ve never been before just by processing visual information, or they can create pictures from a simple text description you provide.
However, language is made up of distinct words and phrases, which are easier for a computer to analyze. Vision, on the other hand, involves processing images or videos, which are much more complex because they contain more detailed information and aren’t made up of simple, separate parts like words.
Recently, the capability of LLMs has expanded from just processing text to also handling images. However, connecting language to vision is still a challenging area. For instance, many existing models have trouble understanding where things are located in an image or counting objects without needing a lot of extra work and additional information.
Many VLMs also lack an understanding of attributes and ordering. They often ignore some part of the input prompt, leading to prompt engineering efforts to produce the result. Some of them can also hallucinate and produce content that is not relevant. Therefore, the development of reliable models remains an active field of research.
The researchers discuss the integration of computer vision and natural language processing through the use of advanced transformer-based techniques and describe four main training strategies VLMs.
The first method, Contrastive training, involves using both positive and negative examples to train models to predict similar representations for the positive pairs while predicting different representations for the negative pairs.
Another method, Masking, conceals parts of an image or words in a text and trains the model to predict the missing pieces. A third strategy utilizes pre-trained components such as existing language models and image encoders, which reduces the computational cost compared to training from scratch.
Lastly, generative training allows models to create new images or captions but are often the most expensive to train. These strategies are often combined in practice, providing an extensive approach to developing VLMs.
Source: Research Paper
Further the research proposes three data-pruning methods for VLMs such as heuristic methods that remove low-quality image-text pairs; bootstrapping approaches that use pretrained VLMs to evaluate and discard image-text pairs with poor multimodal alignment; and strategies designed to produce diverse and balanced datasets.
The paper also focuses on different methods to evaluate VLMs. Visual Question Answering (VQA) is a widely used technique, although its reliance on exact string matching for comparing model outputs with ground truth answers might not fully capture the model’s performance.
Another approach involves reasoning tasks where VLMs choose the most likely caption from a list. Additionally, recent methods include dense human annotations to determine how accurately a model links captions to the appropriate sections of an image.
Lastly, one can use synthetic data to create images in various scenarios to test a VLM’s adaptability to specific changes.
Source: Research Paper
Finally, the researchers establish that mapping vision to language remains a vibrant field of research with various training methods for VLMs, from contrastive to generative approaches. However, the high computational and data costs often pose challenges for researchers, leading many to utilize pre-trained language models or image encoders to facilitate learning across different modalities.
Key factors such as large-scale high-quality images and captions are important for enhancing model performance. Moreover, improving model grounding and alignment with human preferences are also much needed steps to improve a model’s reliability.
While several benchmarks exist to evaluate the vision-linguistic and reasoning capabilities of VLMs, they often have limitations, such as reliance on language priors. Beyond images, video is another critical modality for developing representations, though many challenges remain in effectively leveraging video data. The ongoing research in VLMs continues to address these gaps to improve model reliability and effectiveness.