Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, making them more effective. It involves selecting and transforming variables, creating new features from existing ones, and preparing the data for machine learning algorithms.
Feature engineering is like sculpting a masterpiece . You begin with a block of raw material (raw data) , and through careful shaping and refining (feature engineering) , you transform it into a beautiful sculpture that captures the essence of your vision (a powerful model) . When executed skillfully, each crafted detail enhances the final work, creating a captivating result that resonates with its audience .
Enables Gradient Descent To Run Much Faster
Suppose we are trying to predict house prices of a particular area based on 2 parameters – size of bedroom (x1) and number of bedrooms (x2).
Price = w1x1 + w2x2 + b
Assume we have x1 (Bedroom Size) values in dataset ranging from 300sqft-2000sqft and values of x2 (No of Bedrooms) ranging from 0-5.
Plotting a graph with values with so much difference it is difficult to make it accurate or efficiently.
We can engineer features and perform the below discussed steps to make it more accurate and efficient.
Let us go through some terminologies and methods which will help us do what we need to do.
Feature engineering encompasses tasks such as:
Feature Transformation
Feature transformation is a step in feature engineering that involves utilizing mathematical approaches to change the values of features in order to improve the performance of machine learning models.
This includes
– Handling missing values
– Converting features to numerical as computers do not understand human language
– Detecting outliers
– Scaling features to a specific range.
Handling Missing Values
2 Ways are discussed below –
Imputation is a technique in feature engineering used to fill in missing values in a dataset, typically by substituting with the mean, median, mode, or a specific value. This helps maintain the integrity of the data and ensures that the model can process the entire dataset without errors due to missing values.
Deletion is a technique in feature engineering where rows or columns with missing values are removed from the dataset. This approach ensures that only complete data is used, but it can lead to loss of valuable information, especially if the missing data is extensive.
What technique to use when – When there are a lot of missing values, use imputation to avoid data loss. If there are few missing values, you can consider deleting them which will be more efficient.
Handling Categorical Values
As seen earlier, machine learning model will only understand numbers. So categorical values should be converted to numerical values. For example yes or no -> 1 or 0, or cycle, car, bike -> 0, 1, 2 etc. This process is called Encoding.
Nominal data is categorical data without any order, such as states or branch(like your domain). It can be converted to numerical data using one-hot encoding, which is another Python library called OneHotEncoder().
Ordinal data is data that has an order, such as the level of grades (A+, A, B+, B, C). It can be encoded using ordinal encoding, which is a Python library called OrdinalEncoder().
The above two encoders can be used for explanatory variables (x). For prediction variables (y), we should use LabelEncoder(). Label encoding is specially designed for output variables.
Handling Outliers
Outliers are anomalies. Weird kid in your class or a rose in a sunflower field. You get the idea.
Outliers are data points that are significantly different from the rest of the data set. They can affect the accuracy of our model.
There are more ways to treat outliers but let us see two major ways –
Trimming
Removing outliers from dataset.
Capping
Adjust values of outliers which will be similar to other values in your dataset.
If less outliers are present in your dataset, consider trimming. If number of outliers are more then you should consider Capping.
Feature Scaling
You can understand this as reservation system which most of us hate. Some people have unfair advantage and we want everyone to get equal opportunity. Similarly, some features may have a much larger range than others, which can give them an unfair advantage. “ Feature scaling is a process of transforming the features in a dataset so that they have a common scale.”
This helps to prevent some features from dominating the model like me in League of Legends / Valorant.
2 Common Types of Feature Scaling
Standardization – Standardization is a process of subtracting the mean from each feature and then dividing by the standard deviation. This ensures that each feature has a mean of 0 and a standard deviation of 1. Standardization is often used for data that follows Gaussian distribution, such as linear regression and logistic regression.
Normalization – Normalization is a process of rescaling the features so that they lie between a certain range, such as 0 and 1. This is often used for data that does not follow Gaussian distribution, such as decision trees and support vector machines.
Feature scaling improves the performance of the model and makes sure that all features are contributing not the few dominant ones.
Feature Construction
Imagine you’re sculpting a statue of a lion. Initially, you have a block of marble, which is like having raw data in its unprocessed form. To create features, you begin chiseling away at the marble to reveal specific characteristics: the mane, muscular structure, facial expression, and pose of the lion.
But do not go over the limit as this can detract from the overall coherence and quality of the final sculpture (model). Always keep in mind to create right amount of features to make the data more informative and overdoing it will make it less useful and perform worse.
There are again many ways to achieve this, I won’t discuss them in detail. Just know it is upto you how many and what features will do the work for your model. You can also use feature selection algorithms and libraries to help or you can go all the way yourself. Keep tweaking and make that model work for you.
Conclusion
In conclusion, feature engineering is a crucial step in the data science process that involves transforming, constructing, selecting, and extracting meaningful features from raw data.
Remember, the key to successful feature engineering is finding the right balance. Adding too many features can make the data messy, while a few well-chosen features can greatly improve model performance.
By mastering feature engineering techniques, data scientists can uncover hidden patterns, improve model performance, and derive deeper insights from their data.
Until Next Time ^^
Comments
Great post!
Thanks for reading.
Great post on feature engineering! I loved how you broke down the different techniques like handling missing data and scaling features. Your analogy comparing feature engineering to sculpting was spot on and really helped clarify its importance. Awesome job making complex concepts easy to understand!
Thanks for reading and the review.
[…] Engineering – https://vaibhavshrivastava.com/the-importance-of-feature-engineering-in-machine-learning/Logistic Regression – […]