• Overview
  • Participation
  • Data
  • Timeline
  • FAQ
  • Result Submission
  • Personal Center
  • Leader Board
  • Organizers
  • Contact Us


In VTQA challenge, the model is expected to answer the question according to the given image-text pair. To answer VTQA questions, the proposed model needs to: (1) learn to identifying entities in image and text referred to the question, (2) align multimedia representations of the same entity, and (3) conduct multi-steps reasoning between text and image and output open-ended answer. The VTQA dataset consists of 10124 image-text pairs and 23,781 questions. The images are real images from MSCOCO dataset, containing a variety of entities. The annotators are required to first annotate relevant text according to the image, and then ask questions based on the image-text pair, and finally answer the question open-ended.

Information diversity, multimedia multi-step reasoning and open-ended answer make our task more challenging than the existing tasks. The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.

Challenge Task

As illustrated in the figure, given an image-text pair and a question, a system is required to answer the question by natural language. Importantly, the system needs to: (1) analyze the question and find out the key entities, (2) align the key entities between image and text, and (3) generate the answer according to the question and aligned entities. For example, in Figure 1, the key entity of Q1 is “Elena”. According to the text “gold hair”, we can determine that the second person from the right in the image is “Elena”. Finally, we further answer “suit” based on the image information. As for Q2, which is a more complex question, the previous steps need to be repeated several times to answer it.


26 June 2023

The announcement of the challenge results will be extended to next Monday (3 July). During this period, model evaluation services will no longer be provided. New submissions or revisions of existing submissions will still be accepted before this Friday (30 June). The ranking will be based on the latest model submitted by the participants.

14 June 2023

The top-3 teams will be invited to submit their papers to ACM MM.

6 June 2023

To facilitate the submission of test set , we will provide specific error information when an evaluation run fails and reset the number of submissions to 1.

(Note: we will not backtrack the number of weekly submissions that have already passed; participants will no longer enjoy this service after their first successful submission.)

3 April 2023

Release the English version of dataset and correct some Chinese annotations. Meanwhile, the demo code is updated.


Please enter the correct email format

Length of your password should be between 6 to 20

Please enter your name

Please enter your Institution

Teamname couldn't be empty.

Please fill in your common email account to register.The notification about the competition will be sent to your email.






1.1 本规则是腾讯制定的关于获取和使用QQ号码的相关规则。本规则适用于腾讯提供的需要注册或使用QQ号码的全部软件和服务。

1.2 本规则属于腾讯的业务规则,是《腾讯服务协议》不可分割的组成部分。

1.3 您通过QQ号码使用腾讯的软件和服务时,须同时遵守各项服务的单独协议。