Multi-modal foundation model (image-text)

Time: from 2021~2022
Developed state-of-the-art Multilingual ALIGN models via triple contrastive loss.
Reproduced state-of-the-art vision-language large-scale models.
Reproduced OpenAI’s CLIP and Google’s ALIGN.
The Reproduced ALIGN is uploaded in HuggingFace Model Card.
Reproduced various state-of-the-art vision-language models.
Reproduced PixelBERT and ViLT for visual question answering research.
Participated in molecule structure extraction competition using images in documents.