Exercise Solutions of the Book Forecasting: Principles and Practice 3rd Edition
Chapter 1: Getting Started
What are the possible predictor variables here?
Case 3:
A large car fleet company asked us to help them forecast vehicle resale values. They purchase new vehicles, lease them out for three years, and then sell them. Better forecasts of vehicle sales values would mean better control of profits; understanding what affects resale values may allow leasing and sales policies to be developed in order to maximise profits.
At the time, the resale values were being forecast by a group of specialists. Unfortunately, they saw any statistical model as a threat to their jobs, and were uncooperative in providing information. Nevertheless, the company provided a large amount of data on previous vehicles and their eventual resale values.
Solution:
We do have enough data about the history of resale values of vehicles. But what does the data contain is not mentioned here. Let’s find you what we will need.
First, it’s good to have the car details like the manufacturing company and it’s model. Second, details like the engine power, engine type, etc. will also be useful. In short, every detail we look at, when buying a new car, should be used as a predictor variable AKA feature for forecasting. And the most important being, real value of the car, and it’s manufacturing year. Apart from that, we can have car rating and car’s price hike / drop in these three years.
Now, let’s look at the limitations of a few features. For example, let’s consider the feature “car model”. Now, it might happen that the company has produced a new model which was never designed in the past, and hence, our forecasting algorithm will never recognize it. So I think the model tag, is not much useful in such cases.
But, our end goal it to be as precise as possible when predicting the resale value, right? And we surely know that the model tag alone will be able to get us a boost in accuracy, because most of the people recognize cars by models and not engine power. To be able to use this predictor variable i.e. model, we can think differently.
Let’s have two forecasters, one specially trained on car’s model tag, and other for unknown models i.e. new cars. Now, this also depends on the cost factor of the project, as we already know the company is always buying new cars and selling them in three months. Here, the first forecaster is dependent on past resale prices, but does not need to depend on other features. That could be modeled as a timeseries model. While the other is mostly dependent on external variables, and hence will be a explanatory model.
Explanatory model:
ED=f(current temperature, strength of economy, population,time of day, day of week, error).Timeseries model:
EDt+1=f(EDt, EDt−1, EDt−2, EDt−3, …, error)
We can also try using hybrid models depending on the availability of data.
One other important factor is to consider events and activities. The Odometer reading, if the car has active/past accident cases, insurance amount or so on.
The price might also depend a bit of demand and supply, seasonality geographical region, etc.. But I think it’s better to have a separate algorithm for that which should increase/decreases the predicted resale value considering such things.
Now something that is important, and the firm might not be able to provide data about is, the economic situation of the nation. The best example as of writing this article is the pre-COVID time. People thought it would be nice to have a car so that they won’t have to travel in open, or maybe some other reasons. But there was a 2–7% hike in price of used cars those days. But I think this too should be handled separately, as it might be rare and depending on the situation the price may rise or fall by different amounts.
I forgot one other important point while I was typing the above part.. Maybe you can help?
Case 4:
In this project, we needed to develop a model for forecasting weekly air passenger traffic on major domestic routes for one of Australia’s leading airlines. The company required forecasts of passenger numbers for each major domestic route and for each class of passenger (economy class, business class and first class). The company provided weekly traffic data from the previous six years.
Air passenger numbers are affected by school holidays, major sporting events, advertising campaigns, competition behaviour, etc. School holidays often do not coincide in different Australian cities, and sporting events sometimes move from one city to another. During the period of the historical data, there was a major pilots’ strike during which there was no traffic for several months. A new cut-price airline also launched and folded. Towards the end of the historical data, the airline had trialled a redistribution of some economy class seats to business class, and some business class seats to first class. After several months, however, the seat classifications reverted to the original distribution.
Solution:
I’m going to cut this short as most of the details are already mentioned in the problem description itself, like seasonal data, holidays, festivals, etc.
This model can easily be thought of as a timeseries model. But I think a hybrid would better generalize over long term.
Hybrid:
EDt+1=f(EDt, current temperature, time of day, day of week, error).
As mentioned in the above solution, we need to consider the nation’s economic situation. Also other things like fuel prices, metal/material costs should also be considered. All these will better help the model understand the ticket costs.
Though encoding the above mentioned features might seem a bit hard to you, it’s actually very easy. But it differs from the motive of this blog post, and hence will explain them separately.
What are the steps of forecasting in case 3?
Step 1: Problem definition
Discuss what I said above with the organization and ask what they think about it. Make necessary changes to the feature set / predictor variables, and rethink of the approaches. Discuss the feasibility with the experts in organization about two different models. Finalize what all things you will be building.
Step 2: Gathering Information
Collect data from the organization and/or third party sources. The data here means all the features finalized in the above step. Organize and store the information appropriately.
Step 3: Preliminary (exploratory) analysis
Encode the collected data, analyze it and gain insights of feature importance. Plot graphs as needed and create a observation sheet which contains all the information. The main plots here would be a correlation heatmap, the timeseries line plot of resale values for a specific model, scatter plots depending on 2 principal components and so on.
Step 4: Choosing and fitting models
I mostly think, a Dense network would work for the second model i.e. the one trained for newly released cars in the market. And a Recurrent Network would work just fine for the car models that are already released and sold a few times before. Also, having separate models per car would seem great (highly accurate), but I don’t think it’s feasible. The models would be able to capture the needed things if model tag is provided as input in some encoded way. Also, before trying RNN, give a try to simple curve fitting i.e. linear regression as I think there won’t be much different in the resale price of the same car model over 10 years.
Step 5: Using and evaluating a forecasting model
The time span for future predictions in not mentioned clearly. Hoping it is discussed well in step 1 with the organization, the evaluation can be done by either waiting for live data, or by having a train-test split, or both. Just remember to have a stratified splitting of data if you are doing a train-test split.
Thanks for reading. This exercise is simply what a Solution Architect does in the initial phase of a Machine Learning project.