Managing Data for Machine Learning

Data-centric machine learning calls for intelligently obtaining the best possible data for training a model. Data-centric practices can significantly reduce the financial, labor, and time costs of designing, training, and deploying AI systems in the wild. This research proposes operations-based approaches to data-centric modeling by optimizing what data to collect, synthesize, and label for building ML models.

Relevant publications

  1. Translating labels to solve annotation mismatches across object detection datasets Andrew Liao, David Acuna, Rafid Mahmood, James Lucas, Viraj Prabhu, and Sanja Fidler International Conference on Learning Representations 2024
  2. Bridging the Sim2Real gap with CARE: Supervised detection adaptation with conditional alignment and reweighting Viraj Prabhu, David Acuna, Andrew Liao, Rafid Mahmood, Marc T Law, Judy Hoffman, Sanja Fidler, and James Lucas Transactions on Machine Learning Research (TMLR) 2023   arXiv     Project Page  
  3. Optimizing data collection for machine learning Rafid Mahmood, James Lucas, Jose M Alvarez, Sanja Fidler, and Marc T Law Advances in Neural Information Processing Systems (NeurIPS) 2022   arXiv     Project Page  
  4. How much more data do I need? Estimating requirements for downstream tasks Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M Alvarez, Zhiding Yu, Sanja Fidler, and Marc T Law IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
  5. Low budget active learning via Wasserstein distance: An integer programming approach Rafid Mahmood, Sanja Fidler, and Marc T Law International Conference on Learning Representations (ICLR) 2022   arXiv  
  6. Sampling from the complement of a polyhedron: An MCMC algorithm for data augmentation Timothy CY Chan, Adam Diamant, and Rafid Mahmood Operations Research Letters 2020   Code     DOI     Paper