Loading...
Thumbnail Image
Publication

TOWARD UNIFIED EXPERTISE: ONE MODEL FOR ALL TASKS

Citations
Altmetric:
Abstract
Understanding the real visual world involves processing diverse forms of perception and learning the intrinsic connections among different perceptions. Humans exhibit a remarkable ability to adapt and respond appropriately to various types of visual stimuli, whether it's a glimpse of the 3D real world, perceiving a 2D black and white image, or watching a blurry video clip. In contrast, visual recognition systems often encounter challenges when learning from multiple sources. One such challenge is gradient conflict, where gradients from different tasks contradict each other. This conflict can lead to breakdowns in the systems' ability to learn across multiple tasks simultaneously. Another challenge is catastrophic forgetting, where a neural network trained sequentially on different tasks overwrites what it previously learned, and this can occur at any point between training iterations. This dissertation aims to endow visual recognition systems with multi-task learning (MTL) ability. The aim is to enable these systems to transfer knowledge inductively between tasks. Deep learning naturally clusters similar concepts while maintaining separation between unrelated ones in the data and feature space. Here, the objective is to replicate this effect but optimize within the parameter and task spaces. Gradient conflict and catastrophic forgetting could be alleviated if parameters are carefully assigned to the best set of tasks. Challenges such as gradient conflict and catastrophic forgetting can be mitigated by strategically assigning parameters to the most suitable set of tasks. This allows for better performance across tasks without one task interfering with another. Along with this motivation, this dissertation seeks to identify the most effective neural network architectures for MTL. These architectures should support a scaling law, where increasing the model size, the amount of data, and the number of tasks leads to improved performance across a broad range of tasks—though with diminishing returns as the scale continues to grow. We begin by addressing fundamental visual tasks such as object localization and object categorization. In the initial step, a unified framework was designed to incorporate these basic perceptual capabilities and enable knowledge transfer between tasks. The transformative dynamics between localization and categorization were parameterized and directly modeled to achieve this. This approach involves designing architecture and allocating the model parameters for various purposes, relying on human comprehension. Beyond manual efforts in architecture design, we explore methods for automatically allocating model parameters to specific tasks. This involves creating a framework where different parts of the model can specialize in learning distinct tasks. To achieve this, we introduce the concept of the mixture of experts (MoE), where each expert represents a fundamental building block of the model. These experts can either be shared across a set of tasks or dedicated to a single task, depending on what the system needs. By structuring the model in this way, we avoid the limitations of sharing the entire backbone for every task, while still enabling knowledge transfer between them. We further extend this approach to manage a large number of tasks efficiently. Our strategy focuses on dynamically allocating resources to ensure that as the system scales with more tasks, it maintains high performance and efficiency, allowing the model to grow gracefully without overwhelming computational resources. Further, the strategy of optimizing the parameter and task spaces can extend beyond efficient upstream pre-training to accommodate the diverse needs of downstream applications. In line with this approach, we delve into Dynamic Structured Optimization techniques for adaptable and efficient downstream learning. We explore how the adaptive nature of MoE layers can enable fine-tuning, support continual learning, and provide effective control over model capacity and computational cost. Just as humans have specialized body parts—hands, brains, and more—each suited for specific functions, neural networks are composed of parameters that serve as the AI’s specialized components for different tasks. Our goal is to teach AI how to coordinate these diverse elements, much like the human body seamlessly orchestrates its parts, allowing it to manage a wide range of tasks. By optimizing how these components are allocated and adapt to different tasks, we aim to build AI systems that can handle complex and varied applications efficiently, while scaling gracefully.
Type
Dissertation (Open Access)
Date
2025-02
Publisher
License
Attribution-NonCommercial-ShareAlike 4.0 International
License
http://creativecommons.org/licenses/by-nc-sa/4.0/
Research Projects
Organizational Units
Journal Issue
Embargo Lift Date
Publisher Version
Embedded videos
Related Item(s)