Is “Small Data” The Next Big Thing In Data Science? | by Wouter van Heeswijk, PhD | Aug, 2022

The next decade will revolve around Data-Centric Artificial Intelligence, predicts AI pioneer Andrew Ng. We may not need million of noisy samples if we have 50 well-crafted ones.

We may be at the brink of a small data era [Photo by Daniel K Cheung on Unsplash]

For the past two decades or so, we have lived in an era of ‘Big Data’. With storage capacity and computational power becoming increasingly cheaper, we could store and process massive amounts of data to generate new insights. Fueled by the successes of Google, Amazon and Facebook, substantial breakthroughs in large-scale data analysis have been made, with data-driven decision-making becoming a top priority for many enterprises.

We have witnessed gigantic neural networks, with millions of parameters to tune. Vast streams of social media feeds, processed in real-time. Petabytes of fine-grained information extracted from high-frequency sensor and user logs, stored in enormous server farms. The breakthroughs have been plentiful and exhilarating.

Such big data trends will persist, no doubt. As long as there is more data to collect, we will find new ways to utilize it. For instance, natural language processing has matured, yet video analysis remains pretty much a green field, still awaiting technological advances to propel developments.

Nonetheless, there is a world outside of Silicon Valley that tends to be overlooked. Millions of SMEs (and other organizations) out deal with problems that beg for comprehensive data solutions. These organizations just want to extract valuable insights from their small data sets — leveraging the state-of-the-art in machine learning — without relying on bizarrely large datasets. For them, that time may have arrived.

To make the potential applications a bit more concrete, just consider the following few examples:

  • Cost accounting: Predicting costs for custom-built machines
  • Health care: Identifying tumors on X-ray images
  • Manufacturing: Automatically detecting defects on a production line

The relevance of such examples is eminent, as is the pivotal role that data can play. However, these are not necessarily tasks for which billions of data points are readily available, especially when considering rare defects or diseases. To make the most of modern machine learning, a different angle is needed.

Manufacturing is one of the fields that may benefit from data-driven defect detection, yet the number of relevant defect examples is often too small for effective machine learning [Photo by Mulyadi on Unsplash]

Bring in Andrew Ng. Having founded Google Brain, taught at Stanford, co-founded the online learning platform Coursera (including the extremely popular ‘ Machine Learning’ course), and pioneered the use of GPUs for machine learning, it is safe to say he has some credibility. When he identifies an emerging trend in data science, it pays off to listen.

Andrew argues that — in order to unlock the full potential of artificial intelligence — it is time to start focusing on the quality of data, dubbing the movement data-centric AI. In the past years, the community’s focus has been model-centric, with an emphasis on designing, fine-tuning and improving algorithms suitable for various tasks (text mining, image recognition, etc.).

Model-centric research has been very fruitful, culminating in many high-quality architectures. However, to maintain momentum, designing and improving algorithms alone is not enough. For true progress, the quality of model input should match the quality of the transformation.

We will revisit data-centric AI in more depth, but first we must address the model-centric AI that currently dominates the field.

In model-centric AI, data is assumed to be a given. The focus is on improving the model, trying to get the best possible performance out of the fixed data set [image by author]

Traditionally, data is considered as given input for algorithms. The main question is what machine learning algorithm squeezes the most out of the data. Do we need gradient-boosted trees or neural networks? How many layers, which activation functions, which gradient descent algorithm? The plethora of options posed many challenges in identifying suitable architectures. Large data sets allowed overcoming noisy and missing data.

Andrew Ng postulates that model-centric AI now reached a point of saturation. Many open questions have been resolved, extensively evaluating architectures for various tasks. For instance, Google’s natural language processing algorithm BERT has been trained on the English language. For a different language, we might use the BERT architecture as a starting point — tweaking and tailoring along the way — rather than starting from scratch.

Model-centric AI, through brilliance and experience, has brought us a lot. For many common problems, we now have suitable algorithms that are empirically proven to work well. The implication is that we can use existing models for certain problem classes, rather than re-inventing the wheel for every problem instance we encounter. Combined with available tooling, one no longer needs to be an algorithmic expert to deploy industry-ready AI.

Obviously, model-centric AI is not a dead-end street — algorithmic advances will always continue. However, open-source libraries and example architectures go a long way in solving AI problems. In terms of improvement potential, there is now more to gain in data than in models.

In data-centric AI, models are kept more or less fixed. Instead, the focus is on improving data quality, aiming for an in-depth understanding of small data sets [image by author]

Despite the dazzling amounts of data being produced every day, the quality of such data may be rather poor (as any data scientist knows). Missing data, entry- or measurement errors, duplicates, irrelevant predictors — they all make it hard to train a model. Sufficiently large datasets may overcome such obstacles, yet a dataset that is both small and of poor quality is a recipe for disaster.

Additionally, we are often interested only in specific subsets of data. Ten million images of healthy lungs or a pile of non-fraudulent transactions are of little help for the use cases at hand. Even when data set are sufficiently large at first glance, we often deal with severe class imbalance, having only few meaningful examples to learn from.

Acknowledging the importance of data quality is far from novel — the adagium garbage-in = garbage-out is well-known. Data cleaning typically occurs on an ad-hoc basis, relying on the ingenuity of individual data scientists. Worse, it is not clear up-front fixing which data properties (outliers, missing values, transformations, etc.) impact model performance the most, leading to a frustrating cycle of trial-and-error.

In contrast, data-centric AI propagates a systematic and methodical approach to improve data quality, aimed at the data segments that have the largest effect on performance. By identifying salient features, eliminating noise, analyzing errors and labeling consistently, training effectiveness may be drastically improved. The key is to generalize and automate such procedures.

Until now, the center of gravity has been on improving models rather than improving data itself. Data-centric AI aims to change this.

The notion of systematically improving data quality makes sense, but concretely what developments can we expect? A shift towards ‘small and smart data’ is key, focusing on high-quality data and explainable examples.

Proven architectures can be re-used with just some modifications, more or less fixing the model. In that case, the data itself becomes the object of interest. By fixing one variable (the model), analysis of the other (the data) is a lot easier. However, it is hard to make sense out of large data sets. In-depth human analysis requires small, comprehensive sets.

Data-centric AI will require a substantial shift in culture. Rather than fidgeting with layers and hyperparameters, we will spend substantially more time labeling and slicing data sets. As these are tasks most of us do not necessarily enjoy, this culture shift is not something to take lightly, even if the long-term vision involves automation of tedious cleaning tasks.

In the end, data-centric AI propagates a systematic approach with respect to improving data quality. Two main directions may be distinguished:

  • Developing tools to detect inconsistencies. To truly scale and generalize data improvements, it is necessary to automate detection and cleaning, moving away from time-consuming and error-prone manual data cleaning routines. Crucially, cleaning operations should be consistent and explainable.
  • Leveraging domain knowledge. To accurately interpret what information is conveyed, experts are needed to scrutinize datasets. Precise and accurate features and threshold are needed to get the most out of data, as well as identifying potentially missing examples.

Human interpretability is central in these trends, prompting a move towards small, explainable data sets. Rather than millions of noisy examples, Andrew Ng claims 50 excellent examples could train an ML algorithm equally well. Obviously, the design effort of these examples will be substantial, with every single example yielding a meaningful contribution.

In practice, the desired data might not always be readily available. To augment existing data (based on feature analysis, we can identify the quick wins), a promising direction is that of Generative AI, enabling to construct synthetic data that is indistinguishable from reality. Based on examples and domain knowledge, we may construct artificial examples with precisely the properties we need, e.g., an image of a rare kind of defect or a particular stock market jump.

A shift towards small data would have considerable impact on data science. It opens doors for many problems that don’t have massive associated data sets. It allows generating high-quality artificial data sets. It aligns with the Explainable AI movement that has been gaining traction. Indeed, it would be a fundamental break in the field.

With many breakthroughs in machine learning algorithms being realized over the past decades, it seems a point of saturation has been hit. We have plenty of (open source) libraries and proven architectures to tackle a variety of tasks, such that models can be kept largely fixed. With that in mind, data-centric AI might be the next breakthrough, with a focus on systematic approaches to improve data quality where it matters most.

Current training approaches often rely on sufficiently large sets to overcome noise and missing data. However, many real-world problems generate only small data sets. If we carefully craft representative input by scrutinizing examples, small sets may suffice to train high-quality models. To achieve this, both human expertise and systematic improvement methods are vital to realize generalizable advances.

Concretely, the near future of data science may entail a revitalized attention for activities such as (i) expert analysis, (ii) consistent labeling, (iii) noise removal, (iv) error correction, (v) feature engineering, and (v) artificial data generation. Domain expertise and human interpretability will get more prominence for in-depth assessment of small data sets, yet the long-term aim is to provide systematic and automated solutions to investigate and enhance data quality.

Data-centric AI may considerably alter the daily tasks of many data scientists in the near feature. Quick wins accumulate to competitive edges, and right now, it appears most wins can be achieved by improving data rather than algorithms.

Leave a Reply

Your email address will not be published.