Open Access Open Access  Restricted Access Subscription Access

A Comparative Study of Data-centric vs Model-centric Approaches in Machine Learning

V. Rethash Dev Reddy, Vinay ., Rikansh Thakur, Shivam Jalan

Abstract


Recent trends in machine learning emphasize not only which models we build, but what data we use. In the traditional model-centric paradigm, researchers focus on refining algorithms and architectures (e.g. deeper networks or new optimizers) on a fixed dataset. In contrast, data-centric AI emphasizes improving the dataset itself (through label cleaning, augmentation, or curating examples) while keeping the model constant.

We review these paradigms using illustrative cases: image classification on CIFAR-10 and sentiment analysis on the IMDB movie-review dataset. For example, Bhatt et al. find that a data-centric pipeline (cleaning labels and augmenting images) yields about 3% higher CIFAR-10 accuracy than purely tuning the model. Similarly, studies in NLP show that improving text data (balanced classes, noise removal) and using advanced embeddings (word2vec, BERT) both raise performance. Overall, both approaches complement each other: well-engineered data and well-designed models are needed. Our objectives are to clarify these approaches, compare their effects on real tasks, and highlight key techniques and findings.

 


Full Text:

PDF

References


Andrew Ng's Data-Centric AI Framework : Foundational work establishing the principles of systematic data engineering over model refinement, showing 3% average performance improvements.

Budach et al. (2022) - Effects of Data Quality on ML Performance : Empirical study across 6 data quality dimensions and 19 ML algorithms, demonstrating non-linear performance degradation with data quality issues.

Bhatt et al. (2024) - Data-Centric vs Model-Centric Deep Learning : Direct comparison showing data-centric approaches outperform model-centric by at least 3% on standard benchmarks through augmentation and noise correction.

Gontijo-Lopes et al. (2021) - Data Augmentation Tradeoffs : Introduces Affinity and Diversity metrics for understanding augmentation effectiveness, providing theoretical foundation for data-centric improvements.

Iwana & Uchida (2021) - Time Series Data Augmentation Survey : Comprehensive evaluation showing dataset properties significantly influence

augmentation effectiveness, with correlation analysis across multiple characteristics.

Chen et al. (2023) - NLP Data Augmentation Survey : Large-scale empirical comparison across 11 NLP tasks showing task-dependent effectiveness of different augmentation strategies.

Vakili et al. (2020) - ML Algorithm Performance Comparison : Systematic evaluation of 11 ML algorithms across 6 datasets using standardized metrics, establishing baseline comparison methodologies.

Orzechowski et al. (2022) - Reproducible ML Benchmarks : Introduces systematic approach to reproducible benchmarking with synthetic datasets, addressing reproducibility challenges in ML evaluation.

Mumuni & Mumuni (2022) - Data Augmentation Survey : Extensive review of modern augmentation techniques across computer vision domains, categorizing approaches by information utilization.


Refbacks

  • There are currently no refbacks.