It takes somewhere around 30 percentage of the entire project time right 30. So it is very, very huge. You know, and there are many many people who have asked me questions like krish. What is the exact order see after i get the data, the raw data? What should i first do you know and if you remember in the life cycle of a data science project, the first module that actually comes is feature engineering and then, after that feature selection, then you have model creation. Then you have model deployment. Then you have hyper parameter tuning before model deployment. You have hyper parameter tuning, then you also have incremental learning and there are many more steps as such, but the most important, the crux the backbone of the entire data science project is feature engineering, because you will be cleaning the data. You will be doing a lot of steps, so let me talk about every step that you may be performing in future engineering step by step. Okay, so the first step that i want to discuss about is im just going to write it down the stop. One is basically eda now eda is nothing but exploratory data analysis, exploratory data analysis. Now this is a very, very important thing and remember guys in my youtube channel i have created dedicated playlist on feature engineering on eda everything so ill also be giving those entire link at the last okay, so first step step. One is basically eda that is exploratory data analysis.
Now you may be thinking. Okay, fine, exploratory data analysis. Is it only about data analysis? No. There are many, many steps that we actually perform in this. So let me write it down one by one in eda. As soon as we get the raw data, okay, as soon as we get the raw data, because the entire feature engineering is actually done on the raw data itself right. So as soon as we get the raw data, what do we do? We start doing the analysis. Now, what kind of analysis we do? First of all, i ill just give you one example. First of all, what i actually follow as soon as i get the data, i basically see that how many numerical features may be there: okay, how many numerical features may be there right? Then i may go up with how many categorical or discrete categorical features may be there categorical features. I may i may try to see this numerical features ill, try to dis, define or draw different different diagrams like histogram right, like pdf function. Right and obviously you know all these things, you can use libraries like c bond right. I hope everybodys familiar with c bond. We use c bond. You know you can also use matplotlib to see all this kind of diagrams right and then in the category features youll. Try to analyze the category features like how many category features may be there. You know in those features how many categories, maybe there is there multiple categories see all this observation is actually necessary.
You know all this observation is basically very, very much necessary. Okay, now coming to the third step that i will definitely follow ill, just try to see whether there is any missing values, i will just try to clearly draw clearly draw visual and ill just say: ill. Try to visualize all these graphs visualize, all these graphs. You know with the help of missing values. Also, if there is any missing values, ill try to see you know probably uh. I may go with my fourth step: ill: try to see whether there are outliers and how do you draw an outlier, simple box plot right box plot ill go with box plot ill, try to see whether there is any outliers now. These observations are very much necessary because, whatever diagrams you are actually drawing, this all needs to be sent to your manager to your analytics manager, because that is what you have done in the eda right, and this is just a the first step in the entire feature. Engineering and trust me, there are many more steps which i will be telling you in just a while right, outliers, missing values, category features numerical features, and you know there are various. There are three to four different types of uh handling, missing values. Missing values will be because of different different reasons, and based on that, you have to act accordingly right, so you have outliers. Probably you know youll try to see whether the raw data needs cleaning, also or not, cleaning or not right.
So this. This is also very important step. The raw data may have many information in just one feature and out of that, if you wrote, if you require all those information or not right but again understand the main step over here, what we are trying to do, we are trying to convert the raw data Into useful data into useful data so that our ml algorithms will be able to ingest them properly, ingest them for giving amazing predictions right, so in the eda part, we see all these things right now. Lets come to the second step, very, very simple. Now, in the second step, what i always do is that i start handling the missing values. I start handling the missing values very, very important. There are various ways of handling the missing values you may be saying: okay, krish, we may use mean median more right. All these things right, not only this guys, not only mean medium mode ill, try to analyze those features ill. Try to see whether there is an outlier in that particular feature. This three just one some of the three steps and we have lot of various modes. A lot of various ways to handle the missing values right mean median mode are one of them. You know i may replace some of the features by considering some different different techniques also, and the entire details is mentioned in my feature. Engineering playlist again feature engineering playlist. You know i may analyze it.
I may. I may create a lot of box plots to see okay if im utilizing iqr in removing you know. If you remember there is a formula with respect to iqr, also to remove the outliers and after handling the outliers. What ill do ill try to handle the missing values by median? In short, if you dont want the impact of the outliers, you can directly use median or mode right, so this is basically about the second step, the third step. What i do is that you know step step three, so in the step three, what i can actually do is that handling imbalance data set. You know this is also a very, very important step, because not all the machine learning algorithms works well with an imbalanced data set right, you may get a very bad accuracy and you may be thinking that. Okay, you have got amazing accuracy, but because of the imbalanced data set, you may get a very bad one right now. The fourth one that i would like to do is that treating the outliers right. This is also very much important step. Okay, there are various. There are two to three ways to handle the outliers, also, which you should definitely explore im just telling you step by step whatever i do ill basically use this and before all the uh one more step that i can actually do is scaling the data right scaling. The data in the same scale, we use different different process like standardization, right standardization, normalization right all these techniques we actually used in feature engineering right coming to the sixth step uh.
This is very, very much important. That is converting your categorical features. Converting the categorical features, categorical features into numerical features right. This is the most important step. Numerical features. One example ill tell you suppose you have an example like pin code in pin code. You have different different different values. Right and here you have so many features. So many unique categories so what technique you may probably use in order to convert this categorical features into numerical features, and probably you have to actually use this right now coming to the uh next step. Lets see all these things. What what by by this all these things, what we are actually doing see uh if i go from step, one eda step, two handling the missing values, handling imbalance, data set, treating the outliers scaling the data scaling down the data, just write. It scaling down the data right and then converting the category features into numerical features. Once i perform all the steps. What i think is that yes feature engineering is about 90 completed and dont think that youll just be able to do it this in one day or two day, if you have a small data, set obviously ill say that you will be able to do it in Three to four hours but understand: i have worked with data set where you have one million records right and for doing all these things. It takes time right always make sure that you follow this process.
You always remember the steps. Okay, this is very, very much important. Let me just check out if i have missed any um scaling down category for outlier treatment. Everything is mentioned over here, very clearly, okay, so these are most of the steps that we do in the future engineering and till here right from here to here. So what has happened now see why feature engineering is important. Let me talk about it. The raw data, the raw data in this raw data youll, have so many problems, youll be having right, itll, probably the json format. Probably it may be not having uh proper features. It may be not in the proper format. You know there may be many things right, this raw data. After this entire process of feature engineering, you will be having this clean data, and this d clean data, will now be given to your ml models for the further training purpose. Now, when you have the clean data and youre giving your model to for the training purpose, obviously your model is going to give you better results. There is one more step after feature engineering which is called as feature selection. Feature. Selection is pretty much simple guys in future selection. What we do we select only those features that are important now. Let me tell you that if there are thousand features in your data set right and out of this entire feature out of all these thousand features, it is not necessary that all the thousand features are required.
You know, and if you have that many number of features there is also a term which is called as curse of dimensionality right, and this usually happen. When you have many many features, it is also a curse, so we should take those features that are very important and, in future selection, what are steps we do. Let me write it down over here in future selection. What are steps we actually do in feature selection in feature selection? We perform various steps right. If you remember right, you have correlation right, you have um, one step is basically correlation, and if i talk about more uh, you also have k neighbors. You can use k. Neighbors for the future selection purpose, you have chi square right, you, you have chi square. You have genetic algorithms for doing this right, genetic algorithms. For doing this uh you have something called as feature importance. Right, dissolved techniques. Are there feature importance internally uses extra tree classifier. Here, specifically, you use something called as extractory classifier right all these steps and ive uploaded videos on this also right. So all these steps is basically used for selecting selecting the best features right, selecting the best features. Okay now see this is the most important step and again, if you are having any confusion with respect to anything, what ill do is that just just open the youtube channel? Okay, go to this two playlist one is chris eda. Okay, so here is your exploratory data analysis playlist.
You can go and have a look on to this here in the same steps. I have explained everything if you go and check out this entire playlist right in the same step. Eda then feature engineering and then feature selection in the same step. Ive actually explained everything, and here ive also explained the automated eda part. Okay, so it will be very much easy. This is the one playlist and the other playlist is basically about the feature engineering. So this, too, is must trust me, because 30 percentage of the time – and here all the other different different types of feature engineering. How do we handle category features? How do we handle missing values? See three days four days on handling missing values only has been explained. You know how to handle category features. Everything is being explained. What is standardization transformation? Everything is explained in this handling, missing data and even outlines all these things has been explained right, so my suggestion would be that go ahead. Have a look onto this and yes uh. If you like this particular video, please do make sure that you subscribe the channel press. The bell notification icon but understand feature engineering is a very important step. All together, ill see you all in the next video have a great day.