In this blog, I will tell you about how to deal with missing values or Nan data in data science before passing data through machine learning models.
“Machine Intelligence is the last invention that humanity will ever need to make..” - Nick Bostrom
In real world problems of data science, we can not get the data in pure form or in mannered form. This data may contain many of missing values in features. It may be in in less quantity or sometimes it may be in large quantity. We can't pass the data through machine learning models with these missing values.
In feature engineering, handling missing values is very important process as it may get error while training the model with these data or if get successfully trained the model, it may effect on your model accuracy. So, we have to handle missing values first.
None : Pythonic missing data
The first sentinel value used by Pandas is None , a Python singleton object that is often used for missing data in Python code. Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object'
NaN : Missing numerical data
The other missing data representation, NaN (acronym for Not a Number), is different it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
NaN and None in Pandas
NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate :
How to handle Missing data
pandas provide some methods to treat null or none values, They are :-
isnull() Generate a Boolean mask indicating missing values
notnull() Opposite of isnull()
dropna() Return a filtered version of the data
fillna() Return a copy of the data with missing values filled or imputed
1. Dropping Rows
We can drop number of rows when there are few number of Nan values in any particular feature. make sure you have enough number of observations in the dataset so that droping some rows can't affect the size of dataset for training model. Removing the data will lead to loss of information which will not give the expected results while predicting the output.
2. Dropping Features
We can drop any feature from dataset if there are very large number of Nan values in any particular feature if it is about 70-75% of rows from whole dataset. Make sure before dropping the features, there is very less correlation between that feature and target value. Removing the data will lead to loss of information which will not give the expected results while predicting the output.
3. Replacing with mean/mode/median
This statistical method can be applied on a numeric feature in the dataset. We can calculate the mean, model or median of the whole feature and replace it with missing values. This method prevent the loss of data which gives better results as compared to dropping rows or columns from the data.
4. Assigning unique category
This method is for categorical features which have definite number of possibilities. Since they have definite number of classes, we can assign another class for missing values.
In the example below, the feature cabin have categorical data with missing values and can be replaced with new category , say 'Missing'.
5. Predicting the missing values
Using the features do not have null values, we can predict the missing values with help of machine learning algorithms. This method may result in good accuracy. We can experiment with different machine learning algorithms and check which gives the best accuracy.
6. Using Imputation
A basic strategy to use incomplete dataset is to discard entire rows or columns containing missing values provided by sklearn.imputer package. For more details visit sklearn documentation:
7. Using Algorithms Which Support Missing Values
There are some machine learning algorithms which can be used in the present of null values. KNN is machine learning algorithm which works on principle of distance measure.