Data Preprocessing


  • Input features are the features that an algorithm will learn from to predict the output
  • Output/Label/Target feature is what is being predicted
  • Sampling means randomly selecting a certain percentage of data
  • The training set is the data used to build the model
  • The testing set (also called validation set) is the data used to evaluate the performance of the model that has been trained using the training set

Feature Analysis

Which of the input features significantly contribute to predicting the output feature is called feature analysis. This is done to exclude irrelevant features from consideration.

Data Preprocessing

  • Done to fix data quality issues
  • Transform data so it becomes easy for algorithms to learn from e.g. scale features so they fall within a smaller range e.g money $100 to $100,000 is reduced to fall between 0.0 and 1.0.

Define Dataset

  • Sample a large portion of the data (training set) to build a model and the smaller portion (testing set) to test that model.
  • Data is split to evaluate the model’s performance. A good model performs well on unseen data.
  • The best way to split data is to use sampling techniques such as K-fold Cross-validation i.e. partition data into k different sets e.g 10. Use sets 1 to 9 to train and use set 10 to test. Then use sets 1 to 8 and 10 to train, and use 9 to test. Repeat the process until all sets have been used to test.
  • Repeated training and testing split – create multiple splits of the data into training and testing sets.

This completes all data engineering steps. Now let’s build models!