In this step, the modeler collects and catalogues the data that is believed to be meaningful to solve the problem.
Data may be in many formats and representations. Any data type can be mined as long as it is meaningful for a target application.
- Tabular Data e.g our walking model data
- Multimedia Data e.g. images, videos, and audio, also called computer vision
- Text Data e.g text from documents, chat etc., also called natural language processing
- Time Series Data e.g weather data, stock data etc. time-related or sequence data.
As an example, to predict if you would need a jacket or not, a modeler would organize the data in a tabular format, as under:
ID | CoverUp | Weather | Time | Time of day | Temperature |
1 | Jacket | Sunny | 6:00 am | Morning | 9.65 deg |
2 | No jacket | Not sunny | 5:45 pm | Evening | 18.0 deg |
3 | No jacket | Sunny | 12:15 pm | Afternoon | 22.5 deg |
4 | Jacket | Not sunny | 9:56 pm | Evening | 2.73 deg |
Features (also known as attributes or variables) represent the characteristics of the data. These are the column names. The rows of the data are called observations or records.
Next, you would analyze this data.