In this blog, I will tell you about the categorical variables in data science and statistics also how to handle categorical features in feature engineering.
Categorical data is data that generally takes the limited number of possible values. Categorical data represent characteristics such as a person's gender, marital status, hometown, or the types of movies they like. Also, the data in the category need not be numerical, it can be textual in nature. All machine learning models are kind of mathematical model which needs numbers only to work with. That's why we need to preprocess the categorical data before passing through the machine learning models.
Lets consider following dataset:
data = pd.read_csv('Data.csv') data.head()
You can see that country and Purchased are the categorical variables. since there is only a limited set of values.
In the blog we will see how to use the python scikit-learn library the handle the categorical data. Scikit-learn is a machine learning toolkit that provides various tools for different applications of machine learning e.g. Classification, Regression, Clustering, Dimensionality reduction, Model selection, Preprocessing.
How to handle categorical data
There is a difference in how the categorical data for the dependent and independent variables are handled. We will learn more about this later in the guide. That said, we need to break our data set into the dependent matrix (X) and independent vector (y).
We will create a independent matrix (X) from the dataset:
X = data.iloc[:, :-1].values print(X)
Then we will extract the dataset to get dependent vector:
y = data.iloc[:, -1].values print(y)
Encoding the Categorical Data for Independent Features Matrix X
Now we are going to encode the categorical data for Country so its can be change to numbers which can then be passed to the machine learning models.
consider the following code:
import pandas as pd from sklearn.preprocessing import LabelEncoder df = pd.read_csv('Data.csv') X = df.iloc[:, :-1].values y = df.iloc[:, -1].values labelencoder_X = LabelEncoder() X[:,0] = labelencoder_X.fit_transform(X[:,0]) print(X)
In the above code we have used LabelEncoder class from sklearn preprocessing to transform the labels for Country to numbers. So in Country, 'France' is assign as 0, 'Spain' is assign as 2 and 'Germany' is assign as 1.
The solution of the problem is archived by incorporating the concept of dummy variables. For each of the values of certain category, a new column is introduced. So if the row value of Country is France then that row will get the value as 1 and Spain and Germany will get the values as 0.
Is converted to :
OneHotEncoder is the class in the scikit-learn preprocessing that helps us achive this with ease. Consider the following code block:
import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder df = pd.read_csv('Data.csv') X = df.iloc[:, :-1] y = df.iloc[:, -1] enc = OneHotEncoder(handle_unknown='ignore') enc_df = pd.DataFrame(enc.fit_transform(X[['Country']]).toarray()) X = X.join(enc_df) print(X)
You may notice that the columns have increased in the data set. The column 'Country' is broken into three columns. Thus, the resulting number of columns in X vector is increased from four to seven. Also, notice that after applying the OneHotEncoding function, the values in the Panda Dataframe are changed to scientific notation.
Encoding the Dependent Vector Y
Encoding the dependent vector is much simpler than that of independent variables. For the dependent variables, we don't have to apply the One-Hot encoding and the only encoding that will be utilized is Lable Encoding. In the below code we are going to apply label encoding to the dependent variable, which is 'Purchased' in our case.
labelencoder_y = LabelEncoder() y = labelencoder_y.fit_transform(y) print(y)
Understanding the categorical data is one of the most important aspects of dealing with Data Science. The human mind is designed in a way so that it is easy to understand the representations of the data when presented in the categorical forms.On the other hand, it is not easy for the computers to work with this kind of data, as mathematical equations don't like the input in this form. So firm understanding of concepts required to handle categorical data is a requirement when starting to design your machine learning solutions. It is worth mentioning that not just the input but the ultimate output of your model is also important. If the output of your model is an input to some other data engine than it is best to leave it in the numeric form. However, if the ultimate user of the solution is a human than probably you may want to change the numeric data to categories to help them make easy sense of it.