The Age of 'Bigger is Better'
Not long ago, the path to smarter AI seemed straightforward. The prevailing wisdom, backed by what researchers call "scaling laws," showed that performance improved predictably as you threw more at the problem: more data, more parameters (the 'neurons'
in the network), and more computing power. [4, 8] This led to a boom in massive models with trillions of parameters, trained on vast swaths of the internet. The logic was intoxicating. If a 100-billion parameter model was good, a one-trillion parameter model must be better. This model-centric approach, where the algorithm and its scale are the primary focus, dominated AI research and investment, with labs competing to announce the next record-breaking model size. [7, 16] It was an era defined by brute force, where computational might was seen as the main ingredient for progress.
Cracks in the Scaling Foundation
But this obsession with scale is hitting a wall of practical and ethical limitations. First, the costs are astronomical. Training a massive model can require millions of dollars in computing power and consume enough energy to power a small city. [1, 2, 8] The environmental footprint is also staggering, with some training runs using hundreds of thousands of liters of water for cooling data centers. [9, 20] Second, performance gains are seeing diminishing returns; simply making a model bigger doesn't guarantee it will be proportionally smarter or more reliable. [8] These mega-models are still prone to "hallucinations" (making things up), amplifying biases found in their training data, and struggling with complex reasoning. [1, 4, 21] Finally, the internet, the primary source of training data, is a finite resource. Researchers are beginning to worry about tapping out the well of high-quality human-generated text and images. [8] These challenges suggest that the model-centric arms race is becoming unsustainable.
The 'Better Data' Revolution
Enter the data-centric AI movement, a philosophy championed by pioneers like Andrew Ng. [10, 16] The core idea is simple but profound: for many applications, the fastest path to improvement isn't tweaking the model but systematically engineering the data. [3, 18] Instead of holding data fixed and iterating on code, the data-centric approach holds the model fixed and focuses on improving the quality, consistency, and relevance of the data it learns from. [11, 12] This means focusing on meticulous data curation, cleaning, labeling, and even generating high-quality synthetic data to fill gaps. [23, 24] A smaller model trained on pristine, well-curated data can often outperform a massive model trained on a messy, unfiltered dataset full of "garbage." [22] This shift democratizes AI, allowing more organizations without access to supercomputers to build effective systems by focusing on what they can control: their data. [10, 12]
What to Watch for at ICML 2026
As the machine learning community convenes for the International Conference on Machine Learning (ICML), these trends will be front and center. While there will always be a place for large-scale research, expect to see a surge in papers focused on data-centric themes. Topics like dataset selection, data quality metrics, efficient data labeling, and bias mitigation through data curation will likely dominate many sessions. [17, 26, 29] Discussions that were once about who has the biggest model are shifting toward who has the smartest data strategy. You might see new techniques for improving data organization to reduce model hallucinations or frameworks for generating synthetic data to train models in data-scarce environments. [28, 30] The conversations at ICML 2026 will offer a clear signal: the industry is moving from an era of brute-force scale to one of intelligent, data-driven refinement. It's a sign of a maturing field realizing that what you feed the machine is just as important as the machine itself. [5, 6]













