Different types of Encoding Techniques in Machine Learning

4 min readJun 16, 2024

In the ever-evolving world of machine learning, data is the cornerstone of every model’s success. However, raw data often comes in various forms that are not immediately suitable for algorithmic processing. This is where encoding steps in, transforming data into a format that machine learning models can understand and leverage effectively. In this article, we’ll explore the diverse types of encoding techniques and their applications in machine learning.

The types of encoding in machine learning can be broadly categorized into two main groups: categorical data encoding and numerical data encoding. Here are the common types of encoding used in machine learning:

1. Categorical Data Encoding

Categorical data encoding is used to convert categorical (qualitative) data into numerical values so that machine learning algorithms can process them.

Categorical data represents qualitative attributes, such as names, labels, or other identifiers that need to be converted into numerical values. Here are some popular categorical encoding methods:

Label Encoding:

Converts each unique category value to an integer.
Example: ['cat', 'dog', 'mouse'] becomes [0, 1, 2].
Suitable for ordinal data where there is a meaningful order.

from sklearn.preprocessing import LabelEncoder

data = categorical_cols["Gender"]
label_encoder = LabelEncoder()
encoded_data = label_encoder.fit_transform(data)
print("Encoded Data:", encoded_data)
print("Length of Encoded Data:", len(encoded_data))

One-Hot Encoding:

Converts each category value into a new binary column.
Example: ['cat', 'dog', 'mouse'] becomes [[1, 0, 0], [0, 1, 0], [0, 0, 1]].
Suitable for nominal data where there is no ordinal relationship.

import pandas as pd

# Sample data
data = categorical_cols
df = pd.DataFrame(data, columns=['CALC'])
one_hot_encoded = pd.get_dummies(df['CALC'])

one_hot_encoded

Binary Encoding:

Converts categories into binary numbers and splits these binary digits into separate columns.
Example: ['A', 'B', 'C'] might become ['001', '010', '011'], which is then split into separate columns.
Useful when dealing with high cardinality categorical data.

import category_encoders as ce

# Sample data
data = categorical_cols
df = pd.DataFrame(data, columns=['CAEC'])
binary_encoder = ce.BinaryEncoder(cols=['CAEC'])
binary_encoded = binary_encoder.fit_transform(df)

binary_encoded

Target Encoding (Mean Encoding):

Replaces each category with the mean of the target variable for that category.
Example: For a target variable, categories like ['A', 'B', 'C'] might be encoded based on their mean target values [0.5, 0.3, 0.8].

import pandas as pd
import category_encoders as ce

# Sample data
data = pd.DataFrame(
    {
        'category': ['A', 'B', 'C', 'A', 'B', 'C'], 
        'target': [1, 2, 3, 4, 5, 6]
    }
 )
target_encoder = ce.TargetEncoder(cols=['category'])
target_encoded = target_encoder.fit_transform(data['category'], data['target'])

target_encoded

Frequency Encoding:

Replaces each category with its frequency or count.
Example: ['cat', 'cat', 'dog'] might become [2, 2, 1].

import pandas as pd

# Sample data
data = categorical_cols["MTRANS"]
frequency_encoded = data.map(data.value_counts())

pd.DataFrame(frequency_encoded)

Ordinal Encoding:

Assigns an integer value to each category, preserving the order.
Example: ['low', 'medium', 'high'] becomes [1, 2, 3].

from sklearn.preprocessing import OrdinalEncoder

# Sample data
data = [['low'], ['medium'], ['high']]
ordinal_encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
ordinal_encoded = ordinal_encoder.fit_transform(data)

pd.DataFrame(ordinal_encoded)

2. Numerical Data Encoding

Numerical data encoding involves transforming numerical features to make them more suitable for certain machine learning algorithms.

Normalization (Min-Max Scaling):

Scales the data to a fixed range, usually [0, 1].

Standardization (Z-Score Normalization):

Scales the data to have a mean of 0 and a standard deviation of 1.

Binning (Discretization):

Divides continuous values into discrete bins or intervals.
Example: Age can be binned into categories like ['0-10', '11-20', '21-30'].

Log Transformation:

Applies the logarithm to data, often used to reduce skewness.

Polynomial Features:

Generates new features by taking powers of existing features.
Example: For a feature x, polynomial features could include x^2, x^3, etc.

Power Transformation:

Stabilizes variance and makes the data more Gaussian-like.
Includes techniques like Box-Cox and Yeo-Johnson transformations.

Conclusion

Understanding and choosing the right encoding technique is crucial for the success of any machine learning project. Each encoding method has its strengths and is suited to specific types of data and machine learning tasks. By leveraging the appropriate encoding strategies, you can ensure that your data is in the best possible format for model training, leading to more accurate and robust predictions. Happy encoding!