Tim Ossowski

Profile Pic

Hi! I'm a PhD student in the Computer Science Department at the University of Wisconsin-Madison advised by Professor Junjie Hu. My research is focused on the intersection between vision and language. In particular, I am interested in learning high quality multimodal representations and multimodal retrieval.

I also like making things that look cool 😊

Research Publications

COMMA: A Communicative Multimodal Multi-Agent Benchmark (Under Review)

Publication Image

We introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of scenarios, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-theart models, including proprietary models like GPT-4o. These models struggle to outperform even a simple random agent baseline in agent-agent collaboration and only surpass the random baseline when a human is involved

Authors: Tim Ossowski, Jixuan Chen, Danyal Maqbool, Zefan Cai, Tyler Bradshaw, Junjie Hu

Object Level In-Context Visual Embeddings (OLIVE) (ACL 2024)

We propose a lightweight object encoder that can be connected to existing LLMs to enable controllable object level multimodal reasoning with free-form input annotations. Our model omits image patch features and summarizes object features into a single vector, significantly reducing context length for more efficient training and inference, and allowing for in-context examples from multiple images. We conduct extensive experiments with region retrieval of object level features and showcase rapid adaptation to unseen visual concepts.

Authors: Tim Ossowski, Junjie Hu

How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes (NAACL 2024)

Publication Image

In this work, we investigate the combination of multi-task learning (MTL) with in-context learning (ICL) to build models that efficiently learn tasks while being robust to out-of-distribution examples. Our findings suggest the existence of retrospective heads, within which each input token has a high attention score to the previous input token. Masking these heads results in dramatically decreased in-context capability, whereas masking other heads has little to no effect (shown above). We also propose several effective curriculum learning strategies that allow ICL models to achieve higher data efficiency and more stable convergence.

Authors: Harmon Bhasin, Tim Ossowski, Yiqiao Zhong, Junjie Hu

Multimodal Prompt Retrieval for Generative Visual Question Answering (ACL Findings 2023)

Publication Image

We propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our generative model enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets. Our experiments on medical Visual Question Answering (VQA) tasks show that MPR outperforms its non-retrieval counterpart by up to 30% accuracy points in a few-shot domain adaptation setting

Authors: Tim Ossowski, Junjie Hu

Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment (EMNLP Findings 2022)

Publication Image

We develop a novel Unsupervised Word Translation (UWT) method dubbed Word Alignment using Language-Image Pretraining (WALIP), leveraging visual observations via the shared image-text embedding space of CLIP models. WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.

Authors: Tuan Dinh, Jy-yong Sohn, Shashank Rajput, Tim Ossowski, Yifei Ming, Junjie Hu, Dimitris Papailiopoulos, Kangwook Lee

Personal Projects

Cubic Chunks Map Viewer

Half Dome Image

A program which allows any cubic chunks minecraft world to be rendered as an isometric interactive map. The picture above is generated from the halfdome region in California of the terra 1 to 1 mod. Working on a version which allows any 3d model to be converted to a map.

VoxelDex

Bulbasaur Image

Using the cubic chunks map viewer, I rendered a collection of all generation 1 pokemon. More details on the project page

Boids Simulation

Using three.js, I wrote a fish colony simulation with free camera movement.

Spirograph Generator

cos3t_sint cos3t_sint cos3t_sint

Using the python turtle library, I rendered some animations by rotating certain parametric curves.

Hobbies

Contact Me

If you have any questions or would like to connect, feel free to reach me at:

Email: ossowski@wisc.edu

LinkedIn: Tim Ossowski