Let’s get to know the language of data before we start analyzing it. The above illustration shows that the attributes of the data could be categorical (also called discrete) or it could be numeric (continuous). Binary attributes are only two categories. Nominal attributes have more than two categories.
Assume that the modeler starts with this data:
ID | Time of day | Temperature |
1 | Morning | 9.65 deg |
2 | Evening | 18.0 deg |
3 | Afternoon | 22.5 deg |
4 | Evening | 2.73 deg |
5 | Evening | 10 deg |
6 | Evening | 9.65 deg |
7 | 18.0 deg | |
8 | Afternoon | 15.9 deg |
9 | 2.73 deg | |
10 | 21.1 deg |
The following could be done to explore and analyze it:
- Distribution of features – e.g 10% morning, 40% evening, 20% afternoon, 30% not available
- Relationship between features – correlation between Time of Day and Temperature
- Descriptive statistics – e.g 40% of the temperature was recorded in the evening
- Missing values – identify features with missing values and think of how to treat them
The next step is to wrangle and transform data.
Using the below table as an example, the modeler can do a few steps so that the data becomes more suitable for modeling.
ID | Time of Day | ToD | Temperature | Temp0 | Temp1 |
1 | Morning | 1 | 9.65 deg | 9 | 0.65 |
2 | Evening | 3 | 18.0 deg | 18 | 0 |
3 | Afternoon | 2 | 22.5 deg | 22 | 0.5 |
4 | Evening | 3 | 2.73 deg | 2 | 0.73 |
5 | Evening | 3 | 10.0 deg | 10 | 0 |
6 | Evening | 3 | 9.65 deg | 9 | 0.65 |
7 | 18.0 deg | 18 | 0.0 | ||
8 | Afternoon | 2 | 15.9 deg | 15 | 0.9 |
9 | 2.73 deg | 2 | 0.73 | ||
10 | 21.1 deg | 21 | 0.1 |
- Transform data – convert data into different formats. E.g assign numbers to different Time of Day i.e. morning 1, afternoon 2, evening 3.
- Generate new features – Or split temperature into two features using “.” i.e. 9.65 is split and 9 is in one feature and .65 is another
- Treat missing values – e.g delete missing rows, fill missing values manually, use a constant to fill in missing values, use mean or median of the feature to fill missing values, or use most probable values e.g id 9 looks like Time of Day is afternoon. Note that if your data especially the target feature has missing values, you may encounter an error when building a model.
- Check for imbalanced data – e.g in identifying credit fraud, we got only 10% data which has fraudulent transactions, but 90% of the is good. 10% vs 90% is imbalanced, however, 40% vs 60% is balanced data. Here is how to deal with imbalanced data:
- If you have a lot of data, sample about 15% to 20% of the good transactions in the example above. But if it’s imbalanced because of missing data we could replace missing with average
- If the feature is not important to the problem at hand we could delete it completely
Completing these steps will prepare your data enough to take it to the next step i.e. Data Preprocessing.