In recent years, Microsoft has developed its image and audio recognition systems. While it is true that it does not affect us since it is not used in consumer products, the work is amazing. The other day we were talking about their audio recognition and how it got past the human ear and now they’re past the human eye.
Vision Language (VL) systems allow you to find images relevant to a text query (or vice versa) and describe the content of an image using natural language. Typically, a VL system uses an image encoding module and a vision language fusion module. Microsoft Research recently developed a new object attribute detection model for image encoding called VinVL (Visual Features in Vision-Language).
Microsoft’s object recognition surpasses the human eye
The way to get this new record is through VinVL. This is combined with VL fusion modules like OSCAR and VIVO. As a result, Microsoft’s new VL system was able to reach the first place in the most competitive VL rankings. Other journals include Visual Question Answer (VQA), Microsoft COCO Image Closed Captioning, and Innovative Object Closed Captioning (nocaps). The Microsoft research team also pointed out that this new VL system significantly outperforms human performance in the KO ranking in terms of CIDEr (92.5 vs. 85.3).
VinVL has shown great potential in improving image coding for understanding VL. The picture encoding model can benefit from a wide range of VL tasks. Despite the promising results they obtained, the model by no means reaches the human intelligence of understanding VL.
At Microsoft, they are looking to further extend pre-training to object attribute detection. By exploiting massive image classification / labeling data. On the other hand, to extend the methods of intermodal VL representation. By learning and building perception-based language models that can anchor visual concepts in natural language and vice versa as humans do.
Microsoft VinVL is being integrated into Azure Cognitive Services, which powers various Microsoft services. The Microsoft Research team will also release the VinVL model and source code to the public.