Using AI to predict New York taxi fares
After a colleague returned from a trip to New York, they reviewed their spending and were shocked to see that taxi fares were their second highest spending cost. They were curious to find out if there was any way they could have predicted this before their trip, and if they could have budgeted any better? As it stands, this problem is also something felt by the wider New York population. In New York City, yellow taxi passengers cannot get an advanced estimate for fares.
The New York City Taxi and Limousine Commission’s official stance is: “It is impossible to pre-calculate a fare, because the meter rate depends on traffic, construction, weather, and route to the destination.”
After hearing my colleague complain about their holiday spending (rather than tell us about the sightseeing that we really wanted to hear about) the team saw a problem to be solved. We had a strong feeling that AI and Machine Learning could work effectively to predict these taxi costs.
In true American fashion, we created a “7 Step Program” to ensure our Machine Learning function was usable and replicable:
Step 1: dataset exploration and data interrogation
Exploring and “getting curious” about the data is imperative to building a model with usable data. We created two datasets – train and test – identical except for 1 column: fare amount. This would enable us to run models and predict fares then compare against actual prices. At this stage we got very familiar with the data, spending time understanding how each dictionary column could help inform the model and prediction.
Step 2: cleaning the data
Rubbish in, rubbish out: cleaning the data is imperative to accurate results, for this project we needed to review the distribution of data dictionary elements and identify anomalies to exclude irrelevant outliers. At this point in the model build it’s important to enhance data points available and add more features. We do not accept what we have - it’s important to think about what key information is needed and how to get it from data you have.
Step 3: investigating the different variables for driving fare amount
At this point we need to start visualising the data to ensure the variables put into our model impact and correlate to the fare amount. At this point we’re looking for linear correlations between variables and fare amount.
Step 4: finding insight to help shape the model design
With a wealth of variable visualisations; we can generate, then use them prove or disprove hypotheses to help shape the model and overall reporting.
Step 5: start algorithm training – choosing a basic function
We chose a general linear model for this stage as they are highly flexible and provide clear outputs of effect sizes and statistical significance. This function also help us validate whether the previous steps have been successful. This enabled us to successfully prove the relationship between key variables.
Step 6: refining and developing the algorithm function
Open source can be used to find function elements to add to the algorithm. The XGB function was utilised for this model as it ranks feature importance before visualizing the framework in a gradient descent graph. The algorithm continues to run, at different gradients until it finds the best fit. It’s at this point the algorithm has machine learning capability: the new models created, predict the errors of previous versions to make the final prediction.
Step 7: operationalising the model
To continue its usage and prediction capabilities, at this point we need ensure the model can be understood by any other colleagues who plan on going to New York, so they can calculate their spending money requirements ahead of the trip. We visualised the model using Tableau to share the information around the team. We also recognise that models can always be made stronger, so are continuing to seek out other rates to help decrease the error rate and enhance predictions.
And there you have it, the Station10 7 steps to New York City taxi fare prediction. Get in touch if you want to hear more about our Machine Learning work, or if you want to know how much your holiday taxi fares will cost...