Multi-modal foundation model (image-text)

  • Time: from 2021~2022
  • Developed state-of-the-art Multilingual ALIGN models via triple contrastive loss.
  • Reproduced state-of-the-art vision-language large-scale models.
  • Reproduced OpenAI’s CLIP and Google’s ALIGN.
  • The Reproduced ALIGN is uploaded in HuggingFace Model Card.
  • Reproduced various state-of-the-art vision-language models.
  • Reproduced PixelBERT and ViLT for visual question answering research.
  • Participated in molecule structure extraction competition using images in documents.