Web 
Analytics

Data-Centric AI

Data-centric AI calls for intelligently obtaining the best possible data for training a model. Data-centric practices can significantly reduce the financial, labor, and time costs of developing AI systems in the wild. We explore these problems for modern AI systems.

Relevant publications

  1. AutoScale: Scale-aware data mixing for pre-training LLMs Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, and Ruoxi Jia Conference on Language Modeling 2025   arXiv     Project Page  
  2. Optimizing data collection for machine learning Rafid Mahmood, James Lucas, Jose M Alvarez, Sanja Fidler, and Marc T Law Journal of Machine Learning Research 2025   arXiv     Project Page  
    • Preliminary version appeared in NeurIPS 2022.

  3. Translating labels to solve annotation mismatches across object detection datasets Yuan-Hong Liao, David Acuna, Rafid Mahmood, James Lucas, Viraj Prabhu, and Sanja Fidler International Conference on Learning Representations (ICLR) 2024   Project Page  
  4. Bridging the Sim2Real gap with CARE: Supervised detection adaptation with conditional alignment and reweighting Viraj Prabhu, David Acuna, Andrew Liao, Rafid Mahmood, Marc T Law, Judy Hoffman, Sanja Fidler, and James Lucas Transactions on Machine Learning Research (TMLR) 2023   arXiv     Project Page  
  5. How much more data do I need? Estimating requirements for downstream tasks Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M Alvarez, Zhiding Yu, Sanja Fidler, and Marc T Law IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
  6. Low budget active learning via Wasserstein distance: An integer programming approach Rafid Mahmood, Sanja Fidler, and Marc T Law International Conference on Learning Representations (ICLR) 2022   arXiv