Why do shippers struggle with choosing the right carrier, best mode of transport, and keeping costs under control?
This sounds simple in an ideal world. But every shipping decision must be tailored for the best outcome by the buyer. With the rapidity of change that we are dealing with in the face of the COVID-19 pandemic, a sudden spike in e-commerce and market volatility makes such decisions challenging. In addition, sometimes the goal is not just being cost effective, it is to make the delivery deadline to avoid loss of revenue. It takes an experienced buyer and many hours to optimize the freight cost and schedule.
By using AI, direct cost savings of $72,450 per year for the company were diagnosed in this project besides saving ~1,000 hours of non productive time for a buyer. ML can help SCM teams perform better with easy cost savings and boosting efficiency by helping perform complex repetitive tasks at scale and at speeds that are impossible to achieve manually.
Understanding the Business
Making a good shipping decision involves understanding variables such as rate sheets, lanes serviced by the carrier, and the service level. This takes time and experience. There is an impact on cost if a carrier selection was made without such complete knowledge.
As an example, from prior data, it was observed that 24% of the shipments that should have moved by full truckload moved by less than truckload thereby incurring courier rate instead of minimum trucking rate. If the correct mode of transport would have been used, the company would have saved hundreds of thousands of dollars.
What if Artificial Intelligence can be used to automate and optimize such decision-making? Can Machine Learning techniques help the buyers predict the freight cost based on origin, destination, and weight as well as tell which carrier to utilize?
Let’s find out!
This project focuses on two objectives:
1. Predict freight rate for a shipment based on weight, origin, destination, historical base rate and historical fuel surcharge rate.
This would be a Regression Model. Regression predicts a continuous outcome based on the value of one or more predictor variables.
2. Select the best carrier for a shipment based on historical data with carrier information, weight, origin and destination.
This would be a Classification Model. Classification assigns a label value to a specific class and then predicts such values based on given input data. It is a supervised learning approach in which an algorithm learns from input data and then uses that learning to classify new observations.
The goal was to increase the profitability for the company and at the same time help shipping vendors increase their trucking efficiency. We can win together!
Geographically, this project models shipments within Canada and between US and Canada. It was designed to apply to regular shipments by courier, Less Than Truckload (LTL) and Full Truckload (FTL). It was not designed to apply to odd shipments like rush or time sensitive shipments, oversize shipments, dangerous goods, and such that are out of the ordinary. Each one of those unique situations could be modeled distinctively in the future. Assuming time sensitive shipments, oversize shipments and dangerous goods etc. are about 12% of the total volume of POs issued, this project will cover the regular 88% of the POs.
Transportation companies utilize data to make sure their fleets are utilized efficiently and optimize their capacity, avoid empty miles and other relevant variables.
Mileage, fuel, delivery times, hours, all metrics are tracked, some of that data can also be utilized by the customers to optimize their loads.
The data chosen for this project is:
|Feature Name||Feature Description|
|Carrier||Name of transporters that offer Less Than Truckload, Full Truck Load and Courier service|
|SHIP_CITY||Origin city of shipment, where it was picked up from|
|CONS_CITY||Consignee City, the destination city of shipment, final place of delivery|
|RATED_WGT_LBS||Rated Weight is the Actual Gross Weight or the Volumetric Weight of shipment in Pounds|
|BASE_RATE||Rate billed for the Rated Weight excluding any other surcharges applicable|
|FSC_RATE||Fuel surcharge rate assessed by a carrier to account for regional/seasonal variations in fuel costs to protect it from volatility of fuel prices|
|Total_Price||Base rate + fuel rate. To standardize the calculations, other surcharges have not been taken into consideration|
The raw data was collected from 8 carriers for the year 2021. The Total Price was calculated by adding Base Rate and FSC Rate. For this project, carrier names, base and FSC rates were changed to protect their data.
To keep this project simple and standard across all carriers, other surcharges were not considered in the calculation of the Total Price. It is because surcharges were a minor portion of the Total Cost and can be added up in yet another experiment to generate a model that predicts it.
Because the files came in different versions of Excel and CSV in all shapes and sizes, much time was spent cleaning up the data such as adding or deleting missing values in rows, spell checking city names and formatting the data to standardize it.
Once the data formats were standardized, the data obtained from various carriers were merged as one CSV file.
For the regression model and classification model, the dataset comprised 4,315 records. Even though there were a lot of columns of data available from carriers, each variable was carefully analysed to determine the ones that impact the final decision for target output.
The project focused on working only with those variables instead of uploading irrelevant data.
Below is a sample of the data set used:
Predicting the freight rate would be a regression model for which the Target (Output) would be ‘Total Price’ with the input variables being Carrier, Ship_City, Cons_City, Rated_Wgt_Lbs, Base_Rate, FSC_Rate, and Total_Price.
Predicting the best carrier would be a classification model for which the Target (output) would be ‘Carrier’ and the input variables would be Ship_City, Cons_City and Rated_Wgt_Lbs.
Predicting Freight Rate
Data Wrangling is a process of preparation of data to rearrange, clean and enrich the raw data available into usable format needed for the model building.
The data was analysed to make sure each variable field has right formatting and there are no missing values. The result of data wrangling was a clean dataset that was fit for ML purpose. There were 4315 records.
The Target (Output) and the Input Features were selected.
To predict the freight rate, the ‘Total_Price’ was selected as the Target. The remaining features were selected as input variables.
Collecting data is expensive. Knowing the relative importance of features to predict the Target (output) is important because it provides insights about the data to only select the features most meaningful to the performance of the model.
There are 2 ways to accomplish such feature selection – either only select the relevant variables for the model building – or remove the less influential variables at the data wrangling stage by clicking the red “x” mark beside the variable and removing it entirely.
As an example, the dataset had originally contained a variable called ‘Weight Group’. This was a range based on Less Than Truckload weight break for each line of data with values from 1 to 9 based on the value of rated weight for each shipment. It was determined that this variable will not have an impact on the target output prediction so it was entirely removed from the dataset itself.
Below screenshot illustrates the scores for each feature finally used.
In this step, the data gets transformed to make it easier for a machine to understand it.
Categorical to Numeric refactoring was done for text fields Carrier, Ship_City and Cons_City.
The dataset must be split into training and validation datasets. The purpose of cross validation is to give the machine learning algorithms the training dataset to learn from and then use the validation dataset to test its performance.
The dataset was split in two parts – the training dataset was 80% and validation dataset was set to 20%.
The training dataset is used to train the model. The validation dataset is used to validate the predicted target output to test if the results were correctly predicted, or not.
Models are built by ML algorithms identifying patterns and learning from training data, then applying that knowledge to predict the Target (Output) variable for new data.
Several regression models were created by first creating a Model Container, selecting the dataset and by providing a name for Model Container.
AutoPilot was then run.
It made a dozen or so models and showed the list of models ranked by their relative performance on the leaderboard.
Model evaluation consists of a set of metrics used to measure the performance of a model and to select the best one to publish.
Mean Absolute Error (MAE) refers to the magnitude of difference between the prediction of an observation and the true value of that observation. A lower number means that the model is performing well.
Root Mean Square Error (RMSE) refers to how well a regression line fits the data points. A low number means the predicted regression curve is close to the actual data while a high number means the regression curve deviates a lot from the actual data.
The RMSE was observed to be 0.01, which means that the predicted values are very close to the actual values.
The selected regression model was published for review.
The reviewer can now evaluate the models and accept or reject based on the evaluation criteria.
Jaspreet Gill of Braintoy was the coach for the Applied Machine Learning program and the model reviewer for this project.
This problem was regression, as the target variable was numeric with infinite values.
To start the review, the performance metrics were shown for a comparison of the predicted results against the test data provided to the ML algorithms published. It all looked good so far.
But what if the data to train the model vs. the data to test it were dissimilar! The performance metrics will have lesser meaning then!
The ‘skew’ gives a visual hint that there might be something wrong. The performance scores could only be relied on if the distribution of training and test data is similar for each feature in the training and test set.
It was observed that the distribution of training vs. test for each feature were similar. The performance of the predicted results could be relied on.
The most common metrics used to evaluate the regression model are Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Squared Log Error.
Mean Absolute Error (MAE): MAE measures the average magnitude of the errors. In other words, it’s the average over the test sample of the absolute differences between prediction and actual observation where all the individual differences have equal weight.
MAE score was 0.004. This indicates that the regression model is performing well.
Root Mean Squared Error (RMSE): Like MAE, RMSE is also measuring the average magnitude of the error. It is the square root of the average of the squared difference between prediction and actual observation.
A score of 0.01 indicated good performance.
Mean Squared Log Error (MSLE): It is an extension of MSE. It is used when the prediction value has a large deviation. It is preferred when the target variable is normally distributed. However, the target variable Total price didn’t have Normal Distribution.
This evaluation criterion was ignored.
Based on the above-mentioned performance metrics as well as the similar distribution for each feature in the training and test set, this model was approved for production use.
Once the model was reviewed and accepted, it was ready to be deployed as a microservice/app so that real time predictions can be made in real time.
This was done in a few clicks.
The dashboard is automatically generated for each deployed model.
Users and applications can now interact with the app that has containerized the model.
There are two ways to obtain predictions for the target output. One can either enter new inputs one-by-one and get a prediction in real-time from the API of the application that calls the request, or one can ‘Score’ the data in bulk where a spreadsheet containing numerous records can be uploaded to obtain predicted target (Output) in bulk.
Both methods were tried and tested.
Once the model is deployed, the modeler can interact with it in real-time. The input boxes have a Red line or Green line, with Green depicting that it is of high importance and Red depicting that it is less important for the “specific record” being predicted.
Below is a screenshot of real-time prediction.
It was observed that the model is predicting the right results.
Data Scoring is the process of obtaining predictions in a batch for numerous records.
The input format can be a spreadsheet in CSV format that can be uploaded to mlOS to perform the Data Scoring process.
It was observed that the model is predicting the right results.
The regression model is now deployed and available for production use.
But care is to be taken to monitor the model performance for signs of decay. The mlOS Job Scheduler does that.
But this article is not to go into it.
Let’s move to the next problem!
Predicting Best Carrier
This helps answer the question – if a shipment of ‘x’ pounds must move from one city to another, who is the best carrier?
Weight makes a significant difference in determining the right carrier.
As an example, it’s just not cost-efficient to move a 10 lb shipment on a Less Than Truckload (LTL) rate. Doing that will not only be more expensive, but the transit time will be longer too!
In this case, even though the rate per pound is more than the courier rate, the best way to move a small 10 lb load is indeed by courier.
And putting the shoes of a supplier, why should a truck go empty? Why should we not be able to predict shipping time? These experiments indicated that there is value if the company and the supplier work together.
While analyzing the data, it was determined that approximately 24% of less-weight shipments (under 150 lbs) moved by Less Than Truckload (LTL) service instead of courier, costing the company approximately double the amount of freight cost and probably as much to the suppliers as well in efficiency losses.
Many other similar anomalies were observed.
What if we could collectively gain the value from being efficient together!
The remaining sections in this article are condensed to ‘what was done as against’ and why it was done, because it was already covered in the previous sections.
With the same set of data that was utilized for the Regression Model, a Classification Model was now created.
Variables such as BASE_RATE, FSC_RATE, and TOTAL_PRICE were not selected as those were not relevant to the problem.
CARRIER was the Target (Output). SHIP_CITY, CONS_CITY, and RATED_WGT_LBS were the Input Features.
RATED_WGT_LBS was observed to be most significant for predicting CARRIER.
Categorical to Numeric refactoring was applied to text fields CARRIER, SHIP_CITY, and CONS_CITY.
The data was split in an 80:20 ratio for the training dataset and validation dataset respectively.
AutoPilot was run to view the list of classification models ranked from best to worst.
The model was observed to have an accuracy of 89.43%.
To improve the model efficiency, data was replicated to add more weight breaks for the origin-destination (lane) to balance the data. For example, 10 to 15 lines of origin, destination, weight break, the carrier was added for each lane.
New data set sample size is as below:
The best model now came out to be v18-v.8c0 using DecisionTreeClassifier that showed an accuracy of 95.94%.
For this model, the performance metrics are below:
The predicted vs. original results were observed to match most times.
Besides just accuracy, a few other performance metrics were checked to evaluate this model.
Confusion Matrix (below) is a square matrix whose dimensions depend on the number of classes in the model. If the predictions fall on the diagonal, it means that the classes are being predicted correctly.
The Receiver Operating Characteristic (ROC) curve (above) was used to check if the classifier model has learnt or has simply been memorized. More is the Area Under Curve (AUC), the better the model has learnt. The AUC was observed to be 0.94 indicating good learning.
The selected regression model was published for review. The reviewer can now evaluate the models and accept or reject them based on the evaluation criteria.
The first thing to look for are the performance metrics calculated by comparing predicted results against the test data provided to the algorithms.
But what if the data to train the model vs. the data to test it was dissimilar? The performance metrics will have lesser meaning then.
The ‘skew’ gives a visual hint that there might be something wrong. In such problems, the performance scores could be relied on if only the distribution of data is similar for each feature in the training and the test set.
It was observed that the distribution of training vs. test for each feature were similar. The performance metrics could be relied on.
The second part of the problem was the supervised classification problem that shows an accuracy of 90.11%.
This seemed to be a good model.
But while accuracy is the simplest measure to evaluate model performance, the other evaluation criteria such as ROC curve, Confusion Matrix, F1-score, Hamming Loss, Precision, Recall, and the Jaccard Score are also important.
Precision answers the question – what proportion of positive identifications were actually correct? Recall answers the question – What proportion of actual positives were correctly classified? Both Precision and Recall scores were 0.9 indicating that the model is performing well.
The F1 score is a summary indicator – the harmonic mean of Precision and Recall. In this case, a F1 score of 90.11 indicated good performance.
The Hamming Loss indicates the fraction of records that were incorrectly predicted. It is just 1 minus accuracy for binary classification problems such as this. But it becomes more relevant for multi-class classifications as hamming loss averages the ‘loss’ from each class across the dataset. A score of 0.1 shows that the published model has less ‘loss’.
The Jaccard Score is a measure of similarity between predictions and the test set and their intersection and union. Measured between 0 and 1, a higher score indicates that predictions overlap more with the test set. In this case, a Jaccard score of 0.7 indicated a good performance.
Based on these performance metrics as well as the ROC and the Confusion Matrix described in the previous section, the model was approved for production use. The modeler can now deploy it within an API/microservice.
This model V18.v8 was deployed.
It was tested that the model predicted correct values.
The input boxes have a Redline or Greenline. Greenline is the highest importance and Redline being that it is less important to the specific record being predicted.
The data was also scored in batch/bulk.
We were good to go!
Working with ML has been a huge learning experience.
There is a method to do it consistently to build production models. There are steps to follow. The goal is to make the models perform better by structuring the data and the ML algorithms differently in iterative experiments, fast.
Models perform better as more data gets in. Data from carriers must be uploaded every week or every month to update and calculate close to actual fuel surcharge or special coding to be done to calculate fuel surcharge based on market price when this model gets deployed.
Better decisions can be made by identifying patterns from data.
The cost savings are:
- Right Carrier Selection and MOT: It was observed that ~24% of the shipments had an incorrect mode of transport, approximately 1,035 shipments moved by LTL instead of courier. This means instead of paying courier fees (cheaper), a minimum rate for an LTL truck was paid and the transit took 2 to 3 days extra than transit time for the courier. Assuming that an average of $100 was paid as minimum trucking rate, the total cost is $103,500. If ML would have been used and the courier MOT was chosen for those shipments, even with an estimated cost of $30 per shipment, the total cost would have been only $31,050. This is a direct cost savings of $72,450 per year.
- Time is money: Assuming that a company issues ~300 POs between US and Canadian vendors per day and buyers get frequent emails from vendors for shipping instructions. Assuming ~50% of vendors need specific shipping instructions and that it takes an average of at least ~5 minutes for a Buyer to provide this information to vendors, this is 3 hours per day of each buyer’s time in non-value-adding repetitive tasks. If ML was utilized instead, decisions could have been made in a fraction of a second which saves a lot of time for the buyers to focus on other high-value adding strategies.
- Savings apply not only to the company but also to the shipper: For every dollar saved by the company, it is expected that the transporter also saves as much. The value is in working together. AI can help both.
ML can help SCM teams perform better with easy cost savings and boosting efficiency by helping perform complex repetitive tasks at scale and at speeds that are impossible to achieve manually.
It is mind-blowing to look back at the steps performed to transfer my knowledge to a machine so it can help me make informed decisions and bring better value to the business.
Anitha Ilangovan is a logistics professional for over 15 years.
She has an undergraduate degree in Statistics from Madras Christian College, Chennai, India, and has an SCMP designation from Supply Chain Canada.
She is passionate about continuous learning and applying it to transformation.