Updated: May 14
Data cleaning is an important task to process data. Finding missing values are also part of data cleaning and again it is the most generalized problem for beginners, so here is the blog that may help you to overcome this problem.
The term missing values are denoted in the data frame as NaN stands for Not a Number, there are various ways to overcome missing values for the betterment of modeling. Cleaning the data with missing values doesn't mean to drop all missing values, sometimes dropping nan values may mislead data to form a bad model. Again thinking of skipping this missing data may cause biased statistical results, Also, many ML algorithm does not support data with missing values.
There are three types of missing data:
Missing Completely at Random: No pattern in the missing data on any variables.
Missing Not at Random: There is a pattern in the missing data that affect your primary dependent variables. This is the worst-case scenario.
There are various techniques to deal with the missing data. Let's understand this using example, Below is the dataset of detection of home-prices :
1} Dropping features:
When the missing data is more than 50% of the total sampling data, you can drop that feature. Here in the above image, we can see, the data frame has shape(1460,81), which means, there are 1460 rows and while we checked for null values we get feature PoolQC has 1453 null values which are almost equal to rows present in the data frame. Thus this comes under the first category(Missing Completely at Random) as mentioned above, so we have to drop this feature.
2} Replacing with mean/mode/median:
Using statistical methods we can deal with numeric data. Such as replacing missing values with mean or median or mode of that particular feature.
Here in the above image, you can see the selected feature has dtype float and its missing data has been replaced with mean using the mean function.
3} Dropping rows:
You can drop rows where missing data is very very less (say below 10%).
Make sure you have processed another missing data earlier which is not less than 10% of total sampling data. Since it will drop whole rows of missing values.
4} Assigning a unique category:
This method is used to fill the null value with another value. Consider you have features with categorical values, which means to proceed data you have to convert them into one-hot encoding. Thus replacing nan values with another alphabetical letter will be considered as another category of that feature.
5} Imputation: Imputation is the process of substituting the missing data by some statistical methods. Imputation is useful in the sense that it preserves all cases by replacing missing data with an estimated value based on other available information. But imputation methods should be used carefully as most of them introduce a large amount of bias and reduce variance in the dataset. You will get more information about imputation on :
6} Propagatory methods:
You can fill null values with backward propagation(bfill or backfill) or forward propagation(ffill or pad). This method is used in the reindexed Series.
backfill: Uses the next valid observation to fill the gap.
ffill: Propagates last valid observation forward to next valid.