The Mighty Iris Dataset

I started immersing myself in the world of Artificial Intelligence through the TDSB Experiential Learning Placement internship with BraintoyMy journey to learn Machine Learning started by experimenting with the Iris dataset. This article documents the experiments that I did with it and what I learnt.

Background

The Iris dataset was introduced by Ronald Fisher, a eugenicist, biologist and statistician in the early 20th century. Dr. Fisher contributed heavily to the field of statistics by developing the analysis of variance, promoting the method of derivation of various sampling distributions and much more. He introduced the Iris dataset in his paper “The use of multiple measurements in taxonomic problems” in 1936. In this paper, Fisher created and evaluated a linear function to differentiate Iris flowers based on their unique attributes.

The data in Fisher’s paper was collected by Edgar Anderson, a botanist, to increase the morphologic variation of Iris flowers of the three related species. There are 150 observations in total, 50 samples each of three species of the Iris flower: Setosa, Virginica and Versicolor, with 5 attributes of those species: Sepal Length, Sepal Width, Petal Length, Petal Width. 

With Anderson’s collected data and Richard Fisher’s linear function, they created one of the most famous datasets of all time. The IRIS dataset is now used all over the world and I doubt if there is even one data scientist who has not used Iris. The few number of variables, many observations, being computationally efficient and easy visualization allows the Iris dataset to be used for efficient experiments. Additionally, the data is open source, the accuracy and origin is known, and because there are three classes, it allows for more than just binary classification. All these factors and more contribute to the popularity of the Iris dataset. 

It is the dexterity of this dataset that makes it so popular. In the 1930s, its purpose was to identify the species of the Iris flower based on their morphology. Today, this dataset is used to practice data science as it can be used in supervised and unsupervised machine learning, benchmarking techniques and in solving both classification and regression problems.

My experiments

We use Artificial Intelligence everyday when using Google Maps, Facebook, Uber Eats, Netflix, Amazon, or when we apply for a credit card, or when using autocomplete when writing an email. Just like how humans learn from previous experience to make future decisions, computers also learn from historical data to build models that can be used to make future decisions. The technique of teaching a machine to build a model is called machine learning

Humans are good at recognizing patterns from data. But they can only analyze a small amount of data at a time. And humans also forget. But unlike humans, computers can analyze large quantities of data. Computers also do not forget. If a computer can be taught to recognize patterns from data, it is able to learn and improve its decisions as new and more data comes in.

I practiced machine learning by using the Iris dataset and Braintoy mlOS. The standard CRISP-DM method is used worldwide for data mining. In mlOS, it starts with data collection, data analysis, data preprocessing, then model building, model evaluation, model governance, and finally model deployment.

I learnt two things at the very onset:

  • A dataset is a collection of data that ML algorithms use to train a model. Iris is a dataset that contains 150 records collected by Fisher and Anderson.
  • There are regression problems and then there are classification problems. In a classification problem, a categorical variable is to be predicted. In a regression problem, a quantitative variable is being predicted. A classification problem would be if the species of the Iris flower (a category) was to be predicted while a regression problem would be if the Petal Length (a quantity) was to be predicted. 

The Iris dataset was therefore used by me to first classify the species of the Iris flower by building a classification model. The same dataset was then used to predict the Petal Length of a species of Iris flower by building a regression model.

Figure 1: Predict species (Classification)

Figure 2: Predict Petal Length (Regression)

By learning the ins and outs of the Iris dataset and the dexterity in using it in several ML algorithms in both classification and regression modeling showed me how machine learning could work in real-life.

DATA ENGINE

The first step in any modeling is to ingest raw data, analyze it, and clean it for modeling. 

The raw data was uploaded in mlOS using the Upload Data tab within the Data Engine. It showed up in Raw Data when uploaded. The contents could be seen by clicking the View icon. 

Figure 3: Generated data containing the values of the Iris flower and attributes

There were 150 records with each species of the Iris flower having 50 records each.

Figure 4: Column Details shown about Iris species data

This was observed to be a balanced data for the 3 categories of species. Each feature in the raw data was observed to be well distributed. No missing values were seen. 

I learnt that while distributions can be balanced many times while making newer versions of models, however checking for and treating missing values is vital during data engineering because machine learning algorithms find it hard to compute with missing values in the data!

Analyze Features showed the relationships between the features of the raw data.

Figure 5: Relationship between the species of flower and its four attributes

I could see how the features of the Iris flower – Sepal Length, Sepal Width, Petal Length, and Petal Width – are related to the three categories of species – Setosa, Versicolor, and Virginica.

The next step in modeling is Data Wrangling. This is a method of selecting and converting data into formats that answers an analytical question.

Because this dataset was well categorized and well balanced with each category having 50 records each and that there were no missing values, there was no Data Wrangling necessary. The raw data itself could therefore be directly used to create datasets for classification modeling (to predict the category of Species) or regression modeling (to predict the Petal Length). 

Figure 6: Data wrangling operations for the species of Iris flower

I learnt that the Iris dataset makes learning fast because people do not waste time in repeatedly cleaning and balancing data and rather directly proceed to build datasets for machine learning to test the performance of models. My true learning is that cleaning and balancing raw data is vital to model performance. Using the Iris dataset gave me a good idea of what good data looks like so that I now know what bad data is!

The next step in modeling was to define a dataset to prepare the data for machine learning for either classification or regression modeling.

Defining a dataset has four intermediary steps. The first is to select the Training and Target Features, the second is Feature Pre-processing, the third is to give the dataset a name to Define Dataset, and the last part is to split it in training/test sets as Cross Validation Datasets.

Input and Target Features

A target variable is what allows the machine learning algorithm to know what to predict. The input variables are what will be used to predict it. 

In the Iris classification problem, the input variables would be the Sepal Length, Sepal Width, Petal Length and Petal Width, and the target variable would be the category of the Species of the Iris flower.

Figure 7: Select species as target variable for classification modeling

Figure 8: Select the Petal Length as target variable for regression modeling

Feature Pre-processing

Feature preprocessing transforms data into data that can be used by a machine learning algorithm for modeling. This step is the same for either classification or regression modeling.

There are many pre-processing algorithms available. Because machine learning can compute numbers efficiently, the categories of species of the Iris flower are converted to numbers. Setosa became 1, Virginica became 2, and Versicolor became 3. This is called categorical to numerical refactoring. This was automatically recommended.

Figure 9: Iris species being converted into numerical values 

Define Dataset (naming a dataset)

Once feature preprocessing is complete, a dataset is defined by naming it. 

Figure 10: The dataset is given a name

Cross Validation Datasets

After the dataset is defined, the next step is to split it in training and test sets, called Cross Validation Datasets. Splitting it 80% and 20% is a good thumb rule when the number of records is not an issue.

The larger portion is used to “train” a machine learning algorithm. The smaller portion, aka the “test” set is hidden. Only when the models are trained, the test set is used to “test how well the models have trained”.

Figure 11: Training set is set to 80% while the validation set has been set to 20%

ML ENGINE

Now that the data is prepared, the next step was to build models.

Figure 12: Classification modeling to predict the category of Species

Figure 13: Regression modeling to predict the Petal Length

Many versions of models were generated using various ML algorithms. Their accuracy, algorithm used, and the documentation was seen. Because of the simplicity of the dataset, the parameters need not have to be tuned because the default parameters showed good results.

Note that there is a feature in mlOS called AutoPilot that automatically does everything from data engineering to modeling. But I did not use this feature initially because my goal was to learn the steps of machine learning step-by-step. But since now I knew all the steps, models were built in a few minutes using AutoPilot. The results were similar.

The next step was to publish the most reliable versions of the models to a reviewer so that they can be evaluated, accepted or rejected.

MODEL GOVERNANCE

Model Governance allows the model to be checked for inaccuracy and inefficiencies. It should ideally be reviewed by someone who was not directly involved with the creation of the model.

Figure 14: Workflow showing published models as how a model reviewer sees it

The reviewer checks the model for biases and proceeds to accept or reject the model. Once any model is accepted, it is time to move to Model Deployment, the final step.

MODEL DEPLOYMENT

Only when a model is integrated into a production environment, it can take an input and produce an output in real-time. The final step is to deploy the models as real-time APIs.

Figure 15: The status of the Iris app is shown along with the deployed model

The approved model was selected and deployed on the mlOS cluster in a few clicks. 

The APIs keys could then be generated and shared under Manage Apps.

Once the model was deployed, the application could also be downloaded as an HTML/JavaScript file on my desktop.

The classification application allows entering the values of the four attributes. Clicking the predict button shows the likely species of the Iris flower. The influential and non-influential inputs are shown as well.

Figure 16: Inputs to the application outputs the correct classification of the Iris flower

The regression application allows entering the species name and three attributes. Clicking the Predict button shows the likely Petal Length. The influential and non-influential inputs are shown as well.

Figure 17: Inputs to the app outputs the Petal Length of a species of the Iris flower

Both applications interact with the deployed model API, taking inputs and giving the predicted output in real-time.

DASHBOARD

Monitoring models post-deployment is important. It is to make sure that the model is working as expected, and that model drift does not occur. Model drift is when the data that the model was trained on changes, causing the model’s accuracy rate to drop.

Dashboard is a utility for the modeler to interact with the deployed model and do real time prediction, data scoring, data analysis, variable importance, monitoring APIs, amongst many other tasks.

Figure 18: Real time predictions of the Iris app are being shown under dashboards

My apps were running. The number of successful API calls were seen.

Conclusion

The 21st century application of machine learning could be used to solve the problem that Dr. Fisher and Dr. Anderson pioneered in the early 20th century. I was happy to follow the footsteps of such amazing pioneers in the analysis of variance.

I learnt from these experiments that machine learning could be a powerful tool for data mining.

The technique that I learnt by solving the Iris problem is no different than what a bank uses to decide how much credit to give to a customer. As an example, banks classify credit decisions as yes/no while approving a loan. It is no different from classifying the species of the Iris flower as Setosa, Virginica, or Versicolor. The bank also decides how much loan can be given to a customer. It is no different than using regression to predict the Petal Length of the Iris flower. 

Three Learnings

  1. Understanding AI starts with practicing the basics
  2. To complete a full modeling cycle is important before going deeper
  3. Data can be creatively used in many ways

Author

Shahaan Qureshi is a grade 11 co-op student who’s interest in machine learning sparked during his co-operative education placement with Braintoy during the TDSB Experiential Learning Placement in the beginning of 2021. Shahaan now aspires to continue learning about Artificial Intelligence and pursue a career in the field of data science.