Data Collection

In this step, the modeler collects and catalogues the data that is believed to be meaningful to solve the problem.

Data may be in many formats and representations. Any data type can be mined as long as it is meaningful for a target application.

  • Tabular Data e.g our walking model data
  • Multimedia Data e.g. images, videos, and audio, also called computer vision
  • Text Data e.g text from documents, chat etc., also called natural language processing
  • Time Series Data e.g weather data, stock data etc. time-related or sequence data.

As an example, to predict if you would need a jacket or not, a modeler would organize the data in a tabular format, as under:

IDCoverUpWeatherTimeTime of dayTemperature
1JacketSunny6:00 amMorning9.65 deg
2No jacketNot sunny5:45 pmEvening18.0 deg
3No jacketSunny12:15 pmAfternoon22.5 deg
4JacketNot sunny9:56 pmEvening2.73 deg

Features (also known as attributes or variables) represent the characteristics of the data. These are the column names. The rows of the data are called observations or records.

Next, you would analyze this data.