publications | Jay Gala

An up-to-date list is available on Google Scholar and Semantic Scholar.

2025

arXiv

LLMs Can Compensate for Deficiencies in Visual Representations

Sho Takishita*, Jay Gala*, Abdelrahman Mohamed, Kentaro Inui, and Yova Kementchedjhieva

arXiv preprint, 2025

Abs PDF

Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language decoder.
ICLR

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, and 76 more authors

In Thirteenth International Conference on Learning Representations, 2025

Abs PDF

Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent this limitation and to provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) – a large-scale community-driven initiative expanding MTEB to over 500 \textitquality controlled evaluation tasks across 1,000+ languages. MMTEB includes a wide range of challenging novel tasks such as instruction following, long-document retrieval, and code retrieval, and represents the largest multilingual collection of evaluation tasks for embedding models to date. We use this collection to construct multiple highly multilingual benchmarks. We evaluate a representative set of models on these benchmarks. Our findings indicate that, while LLM-based models can achieve state-of-the-art performance on a subset of languages, the best-performing publicly available model across languages is the notably smaller, multilingual-e5-large-instruct. Massive benchmarks often impose high computational demands, limiting accessibility, particularly for low-resource communities. To address this, we downsample tasks based on inter-task correlation (i.e., selecting only a diverse set of tasks) while preserving relative rankings. We further optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks at a significantly lower computational cost. For instance, we introduce a new zero-shot English benchmark that maintains a similar ordering at a fraction of the cost.
NAACL

SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models

Margaret Mitchell, Giuseppe Attanasio, Ioana Baldini, Miruna Clinciu, Jordan Clive, Pieter Delobelle, and 48 more authors

In The 2025 Annual Conference of the Nations of the Americas Chapter of the ACL, 2025

Abs PDF

Large Language Models (LLMs) reproduce and exacerbate the social biases present in their training data, and resources to quantify this issue are limited. While research has attempted to identify and mitigate such biases, most efforts have been concentrated around English, lagging the rapid advancement of LLMs in multilingual settings. In this paper, we introduce a new multilingual parallel dataset SHADES to help address this issue, designed for examining culturally-specific stereotypes that may be learned by LLMs. The dataset includes stereotypes from 20 regions around the world and 16 languages, spanning multiple identity categories subject to discrimination worldwide. We demonstrate its utility in a series of exploratory evaluations for both “base” and “instruction-tuned” language models. Our results suggest that stereotypes are consistently reflected across models and languages, with some languages and models indicating much stronger stereotype biases than others.

2024

ICML

Leverage Class-Specific Accuracy to Guide Data Generation for Improving Image Classification

Jay Gala, and Pengtao Xie

In Forty-first International Conference on Machine Learning, 2024

Abs PDF Slides Talk

In many image classification applications, the number of labeled training images is limited, which leads to model overfitting. To mitigate the lack of training data, deep generative models have been leveraged to generate synthetic training data. However, existing methods generate data for individual classes based on how much training data they have without considering their actual data needs. To address this limitation, we propose needs-aware image generation, which automatically identifies the different data needs of individual classes based on their classification performance and divides a limited data generation budget into these classes according to their needs. We propose a multi-level optimization based framework which performs four learning stages in an end-to-end manner. Experiments on both imbalanced and balanced classification datasets demonstrate the effectiveness of our proposed method.
ACL Findings

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning in Machine Translation

Everlyn Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce Bassett, and Sara Hooker

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

Abs PDF

Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.
ACL Findings

An Empirical Study of In-context Learning in LLMs for Machine Translation

Pranjal A. Chitale*, Jay Gala*, and Raj Dabre

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

Abs PDF Code Slides

Recent interest has surged in employing Large Language Models (LLMs) for machine translation (MT) via in-context learning (ICL) (Vilar et al., 2023). Most prior studies primarily focus on optimizing translation quality, with limited attention to understanding the specific aspects of ICL that influence the said quality. To this end, we perform the first of its kind, an exhaustive study of in-context learning for machine translation. We first establish that ICL is primarily example-driven and not instruction-driven. Following this, we conduct an extensive exploration of various aspects of the examples to understand their influence on downstream performance. Our analysis includes factors such as quality and quantity of demonstrations, spatial proximity, and source versus target originality. Further, we also investigate challenging scenarios involving indirectness and misalignment of examples to understand the limits of ICL. While we establish the significance of the quality of the target distribution over the source distribution of demonstrations, we further observe that perturbations sometimes act as regularizers, resulting in performance improvements. Surprisingly, ICL does not necessitate examples from the same task, and a related task with the same target distribution proves sufficient. We hope that our study acts as a guiding resource for considerations in utilizing ICL for MT.
ACL

RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization

Jaavid Aktar Husain, Raj Dabre, Aswanth Kumar, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, and 1 more author

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

Abs PDF

This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages, specifically those using non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involves the continual pretraining of an English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text not only reduces token fertility by 2x-4x but also matches if not outperforms native script representation across various NLU, NLG, and MT tasks. Moreover, the embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. Our approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP research.
NeurIPS

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, and 69 more authors

In Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

Abs PDF Website

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.
arXiv

Airavata: Introducing Hindi Instruction-tuned LLM

Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar M, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, and 5 more authors

arXiv preprint, 2024

Abs PDF Code Website

We announce the initial release of Airavata, an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the IndicInstruct dataset, which is a collection of diverse instruction-tuning datasets to enable further research for Indic LLMs. Additionally, we present evaluation benchmarks and a framework for assessing LLM performance across tasks in Hindi. Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages. You can access all artifacts at https://ai4bharat.github.io/airavata.
arXiv

On the low-shot transferability of [V]-Mamba

Diganta Misra*, Jay Gala*, and Antonio Orvieto

arXiv preprint, 2024

Abs PDF

The strength of modern large-scale neural networks lies in their ability to efficiently adapt to new tasks with few examples. Although extensive research has investigated the transferability of Vision Transformers (ViTs) to various downstream tasks under diverse constraints, this study shifts focus to explore the transfer learning potential of [V]-Mamba. We compare its performance with ViTs across different few-shot data budgets and efficient transfer methods. Our analysis yields three key insights into [V]-Mamba’s few-shot transfer performance: (a) [V]-Mamba demonstrates superior or equivalent few-shot learning capabilities compared to ViTs when utilizing linear probing (LP) for transfer, (b) Conversely, [V]-Mamba exhibits weaker or similar few-shot learning performance compared to ViTs when employing visual prompting (VP) as the transfer method, and (c) We observe a weak positive correlation between the performance gap in transfer via LP and VP and the scale of the [V]-Mamba model. This preliminary analysis lays the foundation for more comprehensive studies aimed at furthering our understanding of the capabilities of [V]-Mamba variants and their distinctions from ViTs.

2023

WMT

NICT-AI4B’s Submission to the Indic MT Shared Task in WMT 2023

Raj Dabre, Jay Gala, and Pranjal Chitale

In Proceedings of the Eighth Conference on Machine Translation, 2023

Abs PDF

In this paper, we (Team NICT-AI4B) describe our MT systems that we submit to the Indic MT task in WMT 2023. Our primary system consists of 3 stages: Joint denoising and MT training using officially approved monolingual and parallel corpora, backtranslation and, MT training on original and backtranslated parallel corpora. We observe that backtranslation leads to substantial improvements in translation quality up to 4 BLEU points. We also develop 2 contrastive systems on unconstrained settings, where the first system involves fine-tuning of IndicTrans2 DA models on official parallel corpora and seed data used in AI4Bharat et al, (2023), and the second system involves a system combination of the primary and the aforementioned system. Overall, we manage to obtain high-quality translation systems for the 4 low-resource North-East Indian languages of focus.
TMLR

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

Jay Gala*, Pranjal A. Chitale*, Raghavan AK, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar, and 8 more authors

Transactions on Machine Learning Research, 2023

Abs PDF Code

India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/indictrans2.
EACL

A Federated Approach for Hate Speech Detection

Jay Gala*, Deep Gandhi*, Jash Mehta*, and Zeerak Talat

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

Abs PDF Code Slides Talk

Hate speech detection has been the subject of high research attention, due to the scale of content created on social media. In spite of the attention and the sensitive nature of the task, privacy preservation in hate speech detection has remained under-studied. The majority of research has focused on centralised machine learning infrastructures which risk leaking data. In this paper, we show that using federated machine learning can help address privacy the concerns that are inehrent to hate speech detection while obtaining up to 6.81% improvement in terms of F1-score.

2022

NeurIPS

Expanding Access to ML Research through Student-led Collaboratives

Deep Gandhi, Raghav Jain, Jay Gala, Jhagrut Lalwani, and Swapneel S Mehta

In Workshop on Broadening Research Collaborations (NeurIPS), 2022

Abs PDF

We present a model of a student-led community of researchers to highlight the impact of pursuing collaborative machine learning research on the group’s members individually as well as towards achieving shared goals. We provide concrete examples of the guiding principles that led to the evolution of the collaborative from a reading group into a research group and eventually launching a non-profit software product to help non-technical stakeholders leverage artificial intelligence (AI), improving access to advanced technologies, and promoting open science. Our goal is to lay out a template to launch similar small-scale collaborative organisations at different institutes around the world.
Elsevier

Combating COVID-19 using object detection techniques for next-generation autonomous systems

Hrishikesh Shenai*, Jay Gala*, Kaustubh Kekre*, Pranjal Chitale*, and Ruhina Karani

In Cyber-Physical Systems: AI and COVID-19 (Chapter 4), 2022

Abs HTML

COVID-19 has become a global crisis. During such a time of adversity, it has become difficult to create safe working conditions for people resulting in a lack of workforce for performing a multitude of tasks. As a result, there is a requirement for “next-gen” autonomous systems to perform various tasks. One of the reasons why humans are efficient in doing a lot of tasks is the ability to detect and distinguish between various objects around them and proceed with doing the intended task further. Object detection methods are designed to replicate this human behavior and can be used in various applications that serve to aid in the COVID-19 crisis. Object detection deals with identifying objects belonging to a predefined class in the image. This chapter aims at explaining the most commonly used methods like region-based convolutional neural network and You Only Look Once of object detection along with a few of its applications.

2021

Springer

Improving Image-Based Dialog by Reducing Modality Biases

Jay Gala, Hrishikesh Shenai, Pranjal Chitale, Kaustubh Kekre, and Pratik Kanani

In 5th International Conference on Advances in Computing and Data Sciences, 2021

Abs HTML Code

Machines cannot outperform human intelligence yet; however, an image-based dialog can enable machines to perceive cues from different modalities and process information in a more human-like manner. The proposed solution is an AI-based agent that can have engaging conversations with humans by considering an image and answering questions about its visual content, taking into account both visual and textual context. Deep learning-based techniques like Recurrent Neural Networks (RNNs) with self-attention mechanisms have been employed. Responses generated by such models, in some cases, are more biased towards dialog history and are not very relevant to the actual question asked. The proposed work focuses on reducing the modality biases without compromising dialog history and improving the visual context through the use of dense captions to describe various entities in the image and generate relevant answers.

2020

IVCNZ

Pothole Detection and Dimension Estimation System using Deep Learning (YOLO) and Image Processing

Pranjal Chitale, Kaustubh Kekre, Hrishikesh Shenai, Ruhina Karani, and Jay Gala

In 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), 2020

Abs HTML Code

The world is advancing towards an autonomous environment at a great pace and it has become a need of an hour, especially during the current pandemic situation. The pandemic has hindered the functioning of many sectors, one of them being Road development and maintenance. Creating a safe working environment for workers is a major concern of road maintenance during such difficult times. This can be achieved to some extent with the help of an autonomous system that will aim at reducing human dependency. In this paper, one of such systems, a pothole detection and dimension estimation, is proposed. The proposed system uses a Deep Learning based algorithm YOLO (You Only Look Once) for pothole detection. Further, an image processing based triangular similarity measure is used for pothole dimension estimation. The proposed system provides reasonably accurate results of both pothole detection and dimension estimation. The proposed system also helps in reducing the time required for road maintenance. The system uses a custom made dataset consisting of images of water-logged and dry potholes of various shapes and sizes.
IEEE

IoT and ML based Smart System for Efficient Garbage Monitoring: Real Time AQI monitoring and Fire Detection for dump yards and Garbage Management System

Dev Savla, Amogh Parab, Kaustubh Kekre, Jay Gala, and Meera Narvekar

In 3rd International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020

Abs HTML

There is always a significant amount of challenges associated with waste and its disposal, which can be essentially mitigated by the use of technology. As the urban population increases, the amount of waste disposal is also increasing at an unprecedented rate. The inappropriate disposal of this waste will lead to many hazards including the risk of fires in the dump yards that leverages poisonous smoke in the atmosphere by adversely affecting the safety of nearby residential areas. Monitoring the occurrence of fire in huge dumping grounds manually is a tough task and thus developing an automatic fire extinguishing system is highly required. The advanced technologies can be leveraged to ensure the protection and safety of people by eliminating such hazardous risks. The air quality index (AQI) is an indicator of daily air quality report that shows how air quality affects a person’s life in a very short time. AQI plays a key role in ensuring the safety of residential areas. The proposed system aims to aid the possible hazardous risks associated with the dump yard and waste management.
IEEE

Virtual Farmer: Real Time Crop Prediction and Automatic Irrigation System

Dev Savla, Amogh Parab, Kaustubh Kekre, Jay Gala, S Ramchandra, and Pankaj Sonawane

In 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2020

Abs HTML

There are a plethora of problems associated with agriculture and farming which require considerable improvements but remain untouched by technology. Farming is considered as one of the strong pillars of any economy. Considering this, very few technologies exist to aid the farmers in selecting the right crops depending on the environmental factors. Moreover most irrigation systems all around the world require at least some form of human intervention. Considering this, the proposed smart farming solution aims to aid the farmers by means of technology so as to increase their yield by suggesting to them the crops that will be most profitable for them as well as automating the irrigation system for them. This in turn will be a major help to the agriculture community as a whole and also set free the farmers from certain rudimentary tasks.