Found myself discussing this twice during the last week and decided to write it down.
Not long ago, generating and storing data was significantly more expensive than it is today. As a result, domain experts would carefully deliberate over which features or variables to measure before designing experiments and applying feature transformations. This thoughtful approach often led to well-structured datasets with a limited number of highly relevant features.
In contrast, modern data science has shifted toward a more comprehensive, end-to-end integration approach. With advancements making data generation and storage faster, cheaper, and more accessible, there’s a growing tendency to measure as much as possible and apply increasingly complex feature transformations. Consequently, datasets today are often high-dimensional, containing a large number of features—though not all may be relevant for meaningful analysis.
Many developers rightly assume that during neural network training, the model will naturally learn to minimize the impact of features that lack predictive value by assigning them weights close to or equal to zero. While this is accurate, it doesn't lead to an efficient model.
During inference, much of a model may effectively be "shut off" as it ignores unused components, but those inactive parts still exist. They occupy memory and consume compute resources every time you run the inference to get predictions with the model.
Additionally, irrelevant features can introduce noise into the data, often degrading model performance. High-dimensional datasets can also lead to overfitting. Beyond the model itself, each additional feature requires systems to collect, store, and manage that data—adding cost and complexity to the overall infrastructure. This includes monitoring for data issues and addressing them when they arise, incurring ongoing expenses throughout the product or service's lifecycle, which could span years.
While techniques exist to optimize models by pruning weights near zero, it’s generally unwise to include every possible feature and depend solely on the training process to identify what’s truly useful.
Curse of Dimensionality
Many common machine learning tasks, such as segmentation and clustering, rely heavily on calculating distances between observations. For instance, in supervised classification, distances between data points are often used to assign classes, as seen in algorithms like K-nearest neighbors. Similarly, support vector machines (SVMs) use kernels to project observations into a new space, where the distances between points after projection play a critical role. Recommendation systems also leverage distance-based similarity measures to compare user and item attribute vectors.
Various distance metrics can be applied, but one of the most widely used is Euclidean distance. This metric represents the straight-line distance between two points in a multidimensional space. For two-dimensional vectors with Cartesian coordinates, the Euclidean distance is calculated using the well-known formula:
Why is distance so important? To understand, let’s examine some challenges of measuring distance in high-dimensional spaces.
At first glance, it might not seem obvious why high-dimensional data can be problematic. However, in extreme cases where the number of features (dimensions) exceeds the number of observations, the model risks severe overfitting. Even in less extreme scenarios, having too many features complicates the clustering of observations. Because in high-dimensional spaces data points tend to appear uniformly distant from one another, making it difficult for clustering algorithms that rely on distance metrics to differentiate between points. The volume of the space grows exponentially with each added dimension. As a result, data points spread out, and meaningful differences between them are overshadowed by the high-dimensional noise. Consequently, distance measures lose their effectiveness, necessitating alternative approaches such as dimensionality reduction or the use of distance metrics specifically designed for high-dimensional data.
The term curse of dimensionality describes the counterintuitive behavior of data in high-dimensional spaces. This phenomenon primarily affects our ability to interpret and utilize distances and volumes effectively. It has two key implications:
ML excels at high-dimensional analysis: Machine learning algorithms are better equipped than humans to analyze high-dimensional data. They can identify complex patterns and relationships within datasets that contain numerous features, even when those relationships are complex.
Increased dimensionality demands more resources: As the number of dimensions grows, so do the computational requirements and the need for larger amounts of training data to build effective models.
While it's tempting to add as many features as possible to improve our models, doing so often leads to significant challenges. Redundant or irrelevant features can introduce noise, reducing the predictive power of the model. High dimensionality also complicates data interpretation and visualization. Additionally, more features require increased storage and processing power, raising the cost and complexity of managing the system. Ultimately, adding more dimensions often results in less efficient models.
When models underperform, there is a natural inclination to include more features in an attempt to improve results. However, beyond a certain point, adding features can degrade performance, highlighting the importance of careful feature selection and dimensionality reduction.
(photo by Maria Kostyleva, Instagram)
댓글