Hi! I'm a PhD student in the Computer Science Department at the University of Wisconsin-Madison advised by Professor Junjie Hu. My research is focused on the intersection between vision and language. In particular, I am interested in learning high quality multimodal representations and multimodal retrieval.
We introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of scenarios, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-theart models, including proprietary models like GPT-4o. These models struggle to outperform even a simple random agent baseline in agent-agent collaboration and only surpass the random baseline when a human is involved
Authors: Tim Ossowski, Jixuan Chen, Danyal Maqbool, Zefan Cai, Tyler Bradshaw, Junjie Hu
We propose a lightweight object encoder that can be connected to existing LLMs to enable controllable object level multimodal reasoning with free-form input annotations. Our model omits image patch features and summarizes object features into a single vector, significantly reducing context length for more efficient training and inference, and allowing for in-context examples from multiple images. We conduct extensive experiments with region retrieval of object level features and showcase rapid adaptation to unseen visual concepts.
In this work, we investigate the combination of multi-task learning (MTL) with in-context learning (ICL) to build models that efficiently learn tasks while being robust to out-of-distribution examples. Our findings suggest the existence of retrospective heads, within which each input token has a high attention score to the previous input token. Masking these heads results in dramatically decreased in-context capability, whereas masking other heads has little to no effect (shown above). We also propose several effective curriculum learning strategies that allow ICL models to achieve higher data efficiency and more stable convergence.
Authors: Harmon Bhasin, Tim Ossowski, Yiqiao Zhong, Junjie Hu
We propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our generative model enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets. Our experiments on medical Visual Question Answering (VQA) tasks show that MPR outperforms its non-retrieval counterpart by up to 30% accuracy points in a few-shot domain adaptation setting
We develop a novel Unsupervised Word Translation (UWT) method dubbed Word Alignment using Language-Image Pretraining (WALIP), leveraging visual observations via the shared image-text embedding space of CLIP models. WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.
Authors: Tuan Dinh, Jy-yong Sohn, Shashank Rajput, Tim Ossowski, Yifei Ming, Junjie Hu, Dimitris Papailiopoulos, Kangwook Lee
A program which allows any cubic chunks minecraft world to be rendered as an isometric interactive map. The picture above is generated from the halfdome region in California of the terra 1 to 1 mod. Working on a version which allows any 3d model to be converted to a map.