Dealing with Sparse Datasets in Machine Learning –

by datatabloid_difmmk

This article data science blogthon.

prologue

Missing data in machine learning is a type of data that contains null values, while sparse data is a type of data that does not contain the actual values ​​of sine data.

Sparse datasets with high zero values ​​can cause problems such as overfitting in machine learning models, and several other problems. Dealinar data is therefore one of the busiest processes in machine learning.

In most cases, the sparsity of datasets is not suitable for machine learning problems that need to be handled well. Still, the sparsity of the dataset is appropriate in some cases. Reduces the memory footprint of typical networks to fit mobile devices, Reduce training time for ever-growing networks in deep learning.

sparse dataset

In the image above you can see a dataset with lots of zeros. This means that the dataset is sparse. Most of the time, when using one-hot encoders, this type of sparsity is observed due to the working principle of one-hot encoders.

Need for sparse data handling

Some issues with sparse datasets cause problems while training machine learning models. Due to issues related to sparse data, it should be handled properly.

Common problems with sparse data are:

1. Overfitting:

If the training data contains too many features, the model tends to follow every step of the training data during model training, resulting in high accuracy on the training data and poor performance on the test dataset.

sparse data

In the image above, you can see that the model is overfitting the training data and trying to follow or imitate all trends in the training data. This makes the model perform poorly on test or unknown data.

2. Avoid sensitive data:

Some machine learning algorithms avoid the importance of sparse data and tend to train and fit only dense datasets. It tends not to fit sparse datasets.

Avoided sparse data may also contain training capabilities and useful information that the algorithm ignores. So dealing with sparse datasets is not always the better approach.

3. Spatial complexity

A dataset with sparse features will require more space to store than dense data. Hence the space complexity increases. Therefore, more computational power is required to process this type of data.

4. Time complexity

If the dataset is sparse, the model will take longer to train compared to the data dense dataset because the dataset size is also larger than the dense dataset.

5. Modifying Algorithm Behavior

Some algorithms may or may not perform well on sparse datasets. Some algorithms tend to perform poorly while training on sparse datasets. Logistic regression is one of the algorithms that exhibits flawed behavior on the best line during training on spatial datasets.

How to handle sparse datasets

As explained above, sparse datasets may prove unsuitable for training machine learning models and should be handled appropriately. There are several ways to handle sparse datasets.

1. Convert features from sparse to dense

While training a machine learning model, it’s always good to have dense features in your dataset. If you have sparse data in your dataset, it’s a better way to convert it to dense features.

There are several ways to densify features.

1. Use principal component analysis.

PCA is a dimensionality reduction method used to reduce the dimensionality of datasets and select significant features only in the output.

example:

Implement PCA on datasets

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df)
pca_df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
df = pd.concat([pca_df, df[['label']]], axis = 1)

2. Use feature hashes.

Feature hashing is a technique used with sparse datasets where the dataset can be binned into as many outputs as desired.

from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10)
p = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
f = h.transform(p)
f.toarray()

output:

array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

3. Perform feature selection and feature extraction

4. Use t-Distributed Stochastic Neighbor Embedding (t-SNE)

5. Use a low dispersion filter

2. Remove features from the model

This is one of the easiest and fastest ways to process sparse datasets. This method involves removing some features from the dataset that are not very important for model training.

However, sparse datasets may contain useful and important information that should not be removed from the dataset to improve model training, which can lead to poor performance or accuracy. Please note in particular.

Drop entire columns with sparse data:

import pandas as pd
df = pd.drop(['SparseColumnName'],axis=1)

Dropping columns with sparse data types:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 1, 0])})

df.sparse.to_dense()
print(df)

3. Use methods that are immune to sparse datasets

Some machine learning models are robust to sparse datasets and model behavior is not affected by sparse datasets. This approach can be used if there are no restrictions on the use of these algorithms.

For example, Normal K means that the algorithm suffers from sparse datasets and performs poorly, resulting in poor accuracy. Nevertheless, the entropy-weighted k-means algorithm is immune to sparse data, and therefore gives reliable results. Therefore, it can be used when working with sparse datasets.

Conclusion

Sparse data in machine learning is a pervasive problem, especially when using one hot encoding. Due to the problems caused by sparse data (overfitting, poor model performance, etc.), processing these types of data is recommended to improve model building and improve machine learning model performance. increase.

A few key insight From this blog:

1. Sparse data is very different from missing data. This is a form of data that contains a large number of zero values.

2. Sparse data should be properly handled to avoid issues such as time and space complexity, poor model performance, and overfitting.

3. Dimensionality reduction, which transforms sparse features into dense features and uses algorithms such as entropy-weighted k-means, can be a solution when dealing with sparse datasets.

Media shown in this article are not owned by Analytics Vidhya and are used at the author’s discretion.

You may also like

Leave a Comment

About Us

We’re a provider of Data IT News and we focus to provide best Data IT News and Tutorials for all its users, we are free and provide tutorials for free. We promise to tell you what’s new in the parts of modern life Data professional and we will share lessons to improve knowledge in data science and data analysis field.