Data Analysis

Get to know your data

Let’s get to know the language of data before we start analyzing it. The above illustration shows that the attributes of the data could be categorical (also called discrete) or it could be numeric (continuous). Binary attributes are only two categories. Nominal attributes have more than two categories.

Assume that the modeler starts with this data:

IDTime of dayTemperature
1Morning9.65 deg
2Evening18.0 deg
3Afternoon22.5 deg
4Evening2.73 deg
5Evening10 deg
6Evening9.65 deg
718.0 deg
8Afternoon15.9 deg
92.73 deg
1021.1 deg

The following could be done to explore and analyze it:

  • Distribution of features – e.g 10% morning, 40% evening, 20% afternoon, 30% not available
  • Relationship between features – correlation between Time of Day and Temperature
  • Descriptive statistics – e.g 40% of the temperature was recorded in the evening
  • Missing values – identify features with missing values and think of how to treat them

The next step is to wrangle and transform data.

Using the below table as an example, the modeler can do a few steps so that the data becomes more suitable for modeling.

IDTime of DayToDTemperatureTemp0Temp1
1Morning19.65 deg90.65
2Evening318.0 deg180
3Afternoon222.5 deg220.5
4Evening32.73 deg20.73
5Evening310.0 deg100
6Evening39.65 deg90.65
718.0 deg180.0
8Afternoon215.9 deg150.9
92.73 deg20.73
1021.1 deg210.1
  • Transform data – convert data into different formats. E.g assign numbers to different Time of Day i.e. morning 1, afternoon 2, evening 3.
  • Generate new features – Or split temperature into two features using “.” i.e. 9.65 is split and 9 is in one feature and .65 is another
  • Treat missing values – e.g delete missing rows, fill missing values manually, use a constant to fill in missing values, use mean or median of the feature to fill missing values, or use most probable values e.g id 9 looks like Time of Day is afternoon. Note that if your data especially the target feature has missing values, you may encounter an error when building a model.
  • Check for imbalanced data – e.g in identifying credit fraud, we got only 10% data which has fraudulent transactions, but 90% of the is good. 10% vs 90% is imbalanced, however, 40% vs 60% is balanced data. Here is how to deal with imbalanced data:
      • If you have a lot of data, sample about 15% to 20% of the good transactions in the example above. But if it’s imbalanced because of missing data we could replace missing with average
      • If the feature is not important to the problem at hand we could delete it completely

Completing these steps will prepare your data enough to take it to the next step i.e. Data Preprocessing.