Managing Data for Machine Learning
Data-centric machine learning calls for intelligently obtaining the best possible data for training a model. Data-centric practices can significantly reduce the financial, labor, and time costs of designing, training, and deploying AI systems in the wild. This research proposes operations-based approaches to data-centric modeling by optimizing what data to collect, synthesize, and label for building ML models.
Relevant publications
- Pricing and competition for generative AI Advances in Neural Information Processing Systems (NeurIPS) 2024
- Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models Empirical Methods in Natural Language Processing (EMNLP) 2024 arXiv Project Page
- Translating labels to solve annotation mismatches across object detection datasets International Conference on Learning Representations (ICLR) 2024
- Bridging the Sim2Real gap with CARE: Supervised detection adaptation with conditional alignment and reweighting Transactions on Machine Learning Research (TMLR) 2023 arXiv Project Page
- Optimizing data collection for machine learning Advances in Neural Information Processing Systems (NeurIPS) 2022 arXiv Project Page
- How much more data do I need? Estimating requirements for downstream tasks IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
- Low budget active learning via Wasserstein distance: An integer programming approach International Conference on Learning Representations (ICLR) 2022 arXiv