Data-Centric AI
Data-centric AI calls for intelligently obtaining the best possible data for training a model. Data-centric practices can significantly reduce the financial, labor, and time costs of developing AI systems in the wild. We explore these problems for modern AI systems.
Relevant publications
- AutoScale: Scale-aware data mixing for pre-training LLMs Conference on Language Modeling 2025 arXiv Project Page
- Optimizing data collection for machine learning Journal of Machine Learning Research 2025 arXiv Project Page
Preliminary version appeared in NeurIPS 2022.
- Translating labels to solve annotation mismatches across object detection datasets International Conference on Learning Representations (ICLR) 2024 Project Page
- Bridging the Sim2Real gap with CARE: Supervised detection adaptation with conditional alignment and reweighting Transactions on Machine Learning Research (TMLR) 2023 arXiv Project Page
- How much more data do I need? Estimating requirements for downstream tasks IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
- Low budget active learning via Wasserstein distance: An integer programming approach International Conference on Learning Representations (ICLR) 2022 arXiv