Sales Forecast Part 1
Author: Charlie Cao
This is the first of a series of two notebooks on the topic of Sales Forecast. Through this series, we want to showcase one of the many ways that one can follow exloring and forecasting time series data. We encourage you to create your own Jupytor Notebook and follow along. You can also download this Notebook along with any accompanying data in the Notebooks and Data GitHub Repository. Alternatively, if you do not have Python or Jupyter Notebook installed yet, you may experiment with a virtual Notebook by launching Binder or Syzygy below (learn more about these two tools in the Resource tab).
Background
In this first Notebook, we will focus on explorating and manipulating our data, while the second Notebook will focus on modeling the data. Understanding the data is important for analytics, and we recommend that you read this Notebook before diving into modeling (Part 2) in order to gain a better grasp of our large and messy datasets. The datasets that we are using consist of sales records for a retailer with 45 stores, each containing several departments. They are already included in the GitHub Repository where this Jupyter Notebook is located (please see the “Data” folder), but you can also find them on this Kaggle page.
Let’s first start by importing the libraries we need and loading our data:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
# ipywidgets are used to make interactive contents in Jupyter notebooks
from ipywidgets import interact
df_sales = pd.read_csv('nb0008_data/sales.csv')
df_features = pd.read_csv('nb0008_data/features.csv')
df_stores = pd.read_csv('nb0008_data/stores.csv')
# converting the date column (initally stored as strings) to dates
df_sales['Date'] = pd.to_datetime(df_sales['Date'], format='%d/%m/%Y')
df_features['Date'] = pd.to_datetime(df_features['Date'], format='%d/%m/%Y')
df_sales['Store'] = df_sales['Store'].astype('category')
df_sales['Dept'] = df_sales['Dept'].astype('category')
df_features['Store'] = df_features['Store'].astype('category')
df_stores['Store'] = df_stores['Store'].astype('category')
df_stores['Type'] = df_stores['Type'].astype('category')
# Information on the SettingWithCopywarning that might occur:
# https://www.dataquest.io/blog/settingwithcopywarning/
Exploratory Data Analysis
Before we build a model, it is always a good idea to take a look at the dataset itself. Here are some questions that we might want to ask ourselves when exploring our data:
- What information is recorded? What do the different columns contain?
- Since we are working with time series data, how frequently are the data recorded?
- How many missing values are there in each column?
- What are the distributions of the different columns?
- Are there any columns that are highly correlated with another?
After a preliminary exploration of the data, if there are several missing values or if the distribution of a column is highly left/right skewed, both of which could potentially affect the accruacy of our models, then we would want to first think about replacing or transforming the data before building, comparing, and selecting our models. Since we have a big dataset containing data from multiple stores and multiple departments, we can create an interactive (and basic) widget using ipywidgets
for a preliminary inspection of the sales record at each store and department (please read its documentation for more information or for tutorials).
def sales_viz(store_num, dept_num):
fig, axes = plt.subplots(1,2, figsize=(10,3), dpi=100)
# Subset the data based on store and department numbers
sales_temp = df_sales[(df_sales.Store==store_num) & (df_sales.Dept==dept_num)][['Date', 'Weekly_Sales']]
# Create time series dataframe by using date as index
ts_sales_temp = sales_temp.set_index('Date')
ts_sales_temp.plot(ax=axes[0], legend=False)
ts_sales_temp.hist(ax=axes[1], bins=100)
display(ts_sales_temp.describe().transpose().round(2))
plt.show()
return
display(interact(sales_viz, store_num={n:n for n in range(1,46)}, dept_num={n:n for n in range(1,100)}))
interactive(children=(Dropdown(description='store_num', options={1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: …
<function __main__.sales_viz(store_num, dept_num)>
Inspection
Sales
Now we can take a look at the sales
dataframe, which has 5 fields (columns) and 421,570 observations (rows). Note that we had mannually changed the data types in the ‘Store’, ‘Dept’, and ‘Date’ columns when we imported the data.
print(df_sales.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Store 421570 non-null category
1 Dept 421570 non-null category
2 Date 421570 non-null datetime64[ns]
3 Weekly_Sales 421570 non-null float64
4 IsHoliday 421570 non-null bool
dtypes: bool(1), category(2), datetime64[ns](1), float64(1)
memory usage: 7.6 MB
None
display(df_sales.describe().transpose().round(2))
print("CAUTION: {:.2%} of the `Weekly_Sales` column is negative."
.format(len([n for n in df_sales['Weekly_Sales'] if n < 0])/
len(df_sales['Weekly_Sales'])))
fig, axes = plt.subplots(1,3, figsize=(20,4))
df_sales['Weekly_Sales'].hist(bins=100, ax=axes[0])
sns.boxplot(x='Store', y='Weekly_Sales', whis=100, data=df_sales, ax=axes[1])
sns.boxplot(x='Dept', y='Weekly_Sales', whis=100, data=df_sales, ax=axes[2])
plt.show()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Weekly_Sales | 421570.0 | 15981.26 | 22711.18 | -4988.94 | 2079.65 | 7612.03 | 20205.85 | 693099.36 |
CAUTION: 0.30% of the `Weekly_Sales` column is negative.
display(df_sales.drop('Weekly_Sales', axis=1).describe().transpose())
fig, axes = plt.subplots(1,3, figsize=(20,4))
sns.countplot(x='IsHoliday', data=df_sales, ax=axes[0])
sns.countplot(x='Store', data=df_sales, ax=axes[1])
sns.countplot(x='Dept', data=df_sales, ax=axes[2])
plt.show()
count | unique | top | freq | first | last | |
---|---|---|---|---|---|---|
Store | 421570 | 45 | 13 | 10474 | NaT | NaT |
Dept | 421570 | 81 | 1 | 6435 | NaT | NaT |
Date | 421570 | 143 | 2011-12-23 00:00:00 | 3027 | 2010-02-05 | 2012-10-26 |
IsHoliday | 421570 | 2 | False | 391909 | NaT | NaT |
From the above charts, we can see that:
- There are 143 unique dates (421,570 records in total) from 2010-02-05 to 2012-10-26 in the
sales
dataframe. - There are no missing values in any of the columns in this dataframe, which is good.
- However, we should notice that the
weekly sales
column is extremely right skewed, with a mean of \$15,981, but a 50th percentile (which is essentially the same as the median) of only \$7,612. The range of the column is also fairly wide, from -\$4,989 to \$693,099 - When grouped by stores and departments, respectively, it seems that a smaller proportion of departments are affecting the overall skew across stores.
- 391,909 records out of the 421,570 are not holidays (or about 7% of the records are holidays), which could potentially explain why the
Weekly Sales
records are extremely right skewed (maybe most of the sales happen on holidays). - 99.68% of the
Weekly_Sales
are positive, which means that we could potentially drop the negative values without significantly affecting the integraty of our data. Having all positive numbers would allow us to take the logarithms of these positive values, which will help to make our records less skewed. - There seems to be fewer records for certain stores and departments.
Features
Now let’s move on to the features
dataframe, which has 12 fields (columns) and 8,190 observations (rows). Note that, again, we had mannually changed the data types for the ‘Store’ and ‘Date’ columns when we imported the data.
print(df_features.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8190 entries, 0 to 8189
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Store 8190 non-null category
1 Date 8190 non-null datetime64[ns]
2 Temperature 8190 non-null float64
3 Fuel_Price 8190 non-null float64
4 MarkDown1 4032 non-null float64
5 MarkDown2 2921 non-null float64
6 MarkDown3 3613 non-null float64
7 MarkDown4 3464 non-null float64
8 MarkDown5 4050 non-null float64
9 CPI 7605 non-null float64
10 Unemployment 7605 non-null float64
11 IsHoliday 8190 non-null bool
dtypes: bool(1), category(1), datetime64[ns](1), float64(9)
memory usage: 657.6 KB
None
# inspecting rows containing NaN values
display(df_features[df_features['CPI'].isnull()])
display(df_features[df_features['Unemployment'].isnull()])
Store | Date | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
169 | 1 | 2013-05-03 | 66.66 | 3.386 | 2298.63 | 2.00 | 129.90 | 55.46 | 1301.04 | NaN | NaN | False |
170 | 1 | 2013-05-10 | 63.90 | 3.392 | 4624.61 | 83.88 | 42.38 | 1618.31 | 8144.90 | NaN | NaN | False |
171 | 1 | 2013-05-17 | 69.53 | 3.454 | 16170.50 | 92.00 | 8.32 | 4127.24 | 6206.97 | NaN | NaN | False |
172 | 1 | 2013-05-24 | 77.19 | 3.494 | 7959.89 | 178.00 | 1621.47 | 3152.57 | 2938.70 | NaN | NaN | False |
173 | 1 | 2013-05-31 | 78.02 | 3.466 | 2369.77 | 142.45 | 475.35 | 45.55 | 2056.84 | NaN | NaN | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8185 | 45 | 2013-06-28 | 76.05 | 3.639 | 4842.29 | 975.03 | 3.00 | 2449.97 | 3169.69 | NaN | NaN | False |
8186 | 45 | 2013-07-05 | 77.50 | 3.614 | 9090.48 | 2268.58 | 582.74 | 5797.47 | 1514.93 | NaN | NaN | False |
8187 | 45 | 2013-07-12 | 79.37 | 3.614 | 3789.94 | 1827.31 | 85.72 | 744.84 | 2150.36 | NaN | NaN | False |
8188 | 45 | 2013-07-19 | 82.84 | 3.737 | 2961.49 | 1047.07 | 204.19 | 363.00 | 1059.46 | NaN | NaN | False |
8189 | 45 | 2013-07-26 | 76.06 | 3.804 | 212.02 | 851.73 | 2.06 | 10.88 | 1864.57 | NaN | NaN | False |
585 rows × 12 columns
Store | Date | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
169 | 1 | 2013-05-03 | 66.66 | 3.386 | 2298.63 | 2.00 | 129.90 | 55.46 | 1301.04 | NaN | NaN | False |
170 | 1 | 2013-05-10 | 63.90 | 3.392 | 4624.61 | 83.88 | 42.38 | 1618.31 | 8144.90 | NaN | NaN | False |
171 | 1 | 2013-05-17 | 69.53 | 3.454 | 16170.50 | 92.00 | 8.32 | 4127.24 | 6206.97 | NaN | NaN | False |
172 | 1 | 2013-05-24 | 77.19 | 3.494 | 7959.89 | 178.00 | 1621.47 | 3152.57 | 2938.70 | NaN | NaN | False |
173 | 1 | 2013-05-31 | 78.02 | 3.466 | 2369.77 | 142.45 | 475.35 | 45.55 | 2056.84 | NaN | NaN | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8185 | 45 | 2013-06-28 | 76.05 | 3.639 | 4842.29 | 975.03 | 3.00 | 2449.97 | 3169.69 | NaN | NaN | False |
8186 | 45 | 2013-07-05 | 77.50 | 3.614 | 9090.48 | 2268.58 | 582.74 | 5797.47 | 1514.93 | NaN | NaN | False |
8187 | 45 | 2013-07-12 | 79.37 | 3.614 | 3789.94 | 1827.31 | 85.72 | 744.84 | 2150.36 | NaN | NaN | False |
8188 | 45 | 2013-07-19 | 82.84 | 3.737 | 2961.49 | 1047.07 | 204.19 | 363.00 | 1059.46 | NaN | NaN | False |
8189 | 45 | 2013-07-26 | 76.06 | 3.804 | 212.02 | 851.73 | 2.06 | 10.88 | 1864.57 | NaN | NaN | False |
585 rows × 12 columns
display(df_features.describe().round(2))
for i in range(5):
print('CAUTION: {:.2%} of MarkDown{} (non-null) is negative.'
.format(len([n for n in df_features['MarkDown'+str(i+1)] if n < 0])/
len(df_features['MarkDown'+str(i+1)].dropna()), i+1))
df_features.drop(['Store', 'Date', 'IsHoliday'], axis=1).hist(figsize=(15,10), bins=100)
plt.show()
Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | |
---|---|---|---|---|---|---|---|---|---|
count | 8190.00 | 8190.00 | 4032.00 | 2921.00 | 3613.00 | 3464.00 | 4050.00 | 7605.00 | 7605.00 |
mean | 59.36 | 3.41 | 7032.37 | 3384.18 | 1760.10 | 3292.94 | 4132.22 | 172.46 | 7.83 |
std | 18.68 | 0.43 | 9262.75 | 8793.58 | 11276.46 | 6792.33 | 13086.69 | 39.74 | 1.88 |
min | -7.29 | 2.47 | -2781.45 | -265.76 | -179.26 | 0.22 | -185.17 | 126.06 | 3.68 |
25% | 45.90 | 3.04 | 1577.53 | 68.88 | 6.60 | 304.69 | 1440.83 | 132.36 | 6.63 |
50% | 60.71 | 3.51 | 4743.58 | 364.57 | 36.26 | 1176.42 | 2727.14 | 182.76 | 7.81 |
75% | 73.88 | 3.74 | 8923.31 | 2153.35 | 163.15 | 3310.01 | 4832.56 | 213.93 | 8.57 |
max | 101.95 | 4.47 | 103184.98 | 104519.54 | 149483.31 | 67474.85 | 771448.10 | 228.98 | 14.31 |
CAUTION: 0.10% of MarkDown1 (non-null) is negative.
CAUTION: 0.86% of MarkDown2 (non-null) is negative.
CAUTION: 0.36% of MarkDown3 (non-null) is negative.
CAUTION: 0.00% of MarkDown4 (non-null) is negative.
CAUTION: 0.05% of MarkDown5 (non-null) is negative.
fig, axes = plt.subplots(1,2,figsize=(15,5))
sns.lineplot(x='Date', y='CPI', data=df_features[df_features.Store==1], ax=axes[0], label='Store 1')
sns.lineplot(x="Date", y="CPI", hue="Store", data=df_features, legend=False, ax=axes[1])
fig.autofmt_xdate(rotation=-30, ha='left')
plt.show()
display(df_features[['Store', 'Date', 'IsHoliday']].describe().transpose())
fig, axes = plt.subplots(1,2, figsize=(15,5))
sns.countplot(x='IsHoliday', data=df_features, ax=axes[0])
sns.countplot(x='Store', data=df_features, ax=axes[1])
plt.show()
count | unique | top | freq | first | last | |
---|---|---|---|---|---|---|
Store | 8190 | 45 | 45 | 182 | NaT | NaT |
Date | 8190 | 182 | 2011-07-15 00:00:00 | 45 | 2010-02-05 | 2013-07-26 |
IsHoliday | 8190 | 2 | False | 7605 | NaT | NaT |
We can see from the above tables and plots that:
- Compared to the
sales
dataframe, there are fewer records (8,190) in thefeatures
dataframe. This makes sense, because this dataframe is only detailed to the store-level and not to the department-level (different departments in the same store probably experience the same weather on the same day). - The are 182 unique dates from 2010-02-05 to 2013-07-26, which is a wider range of records than the
sales
dataframe. - The missing values in the ‘CPI’ and ‘Unemployment’ columns do not have corresponding observations in the
sales
dataframe (outside the date range), which means that removing these rows would not likely affect the integrity of our analysis. - From the first 10 rows, we can see that the dates do match up; however, it would not hurt to check the proportion of matching records (by date and store), especially if we are thinking about merging the two dataframes together.
- The five markdown columns contain ‘anonymous data related to promotional markdowns’, as per Kaggle, and we might want to further inspect these columns (e.g. whether two or more markdowns occured at any point for the same store). At the same time, these five columns are also extremely right skewed.
- Less than 1% of each of the markdowns contain negative values, and we could consider dropping these observations if we need to take the logarithms of this field to make it less right skewed.
- The CPI (Consumer Price Index) is steadily increasing for all stores, but different stores have different levels of CPIs. There also seems to be a bimodal distribution among the CPI observations.
- Approximately 7% of the 8,190
features
records are holidays, which is consistent with thesales
dataframe.
Stores
Last but not least, let’s take a look at the stores
dataframe, which only has three fields (columns) and 45 observations (rows). Again, note that we had mannually changed the data types for the ‘Store’ and ‘Type’ columns when we imported the data.
display(df_stores.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Store 45 non-null category
1 Type 45 non-null category
2 Size 45 non-null int64
dtypes: category(2), int64(1)
memory usage: 2.3 KB
None
display(pd.DataFrame(df_stores.drop('Store', axis=1).describe(include='all')).transpose().round(2))
fig, axes = plt.subplots(1,3, figsize=(20,4))
sns.stripplot(x='Type', y='Size', data=df_stores, ax=axes[0])
sns.countplot(x='Type', data=df_stores, ax=axes[1])
sns.distplot(df_stores['Size'], bins=15, kde=False, ax=axes[2])
plt.show()
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
Type | 45 | 3 | A | 22 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Size | 45 | NaN | NaN | NaN | 130288 | 63825.3 | 34875 | 70713 | 126512 | 202307 | 219622 |
The stores
dataframe appears to be smaller and simpler. We can see that:
- Almost half of the stores are type A stores, and there is a correlation between store size and store type, with A being the largest and C being the smallest, on average.
Sizes
of the stores are not extremely skewed, with the mean close to the 50th percentile. However, the sizes of the stores seem to follow a trimodel distribution, with the left and right peaks being higher (less spread out) than the middle one.
Aggregation
After exploring each dataset by itself, it is also a good idea to merge the different datasets so that you could visualize any possible relationships that might exist between two or more variables. In our case, we are going to left join the features
and stores
dataframe to the sales
dataframe, because we want to preserve all the sales data (since we are analyzing sales).
df_merge = df_sales.merge(df_features, on=['Date', 'Store'], how='left').merge(df_stores, on='Store', how='left')
df_merge.head()
Store | Dept | Date | Weekly_Sales | IsHoliday_x | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday_y | Type | Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 2010-02-05 | 24924.50 | False | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | False | A | 151315 |
1 | 1 | 1 | 2010-02-12 | 46039.49 | True | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | True | A | 151315 |
2 | 1 | 1 | 2010-02-19 | 41595.55 | False | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | False | A | 151315 |
3 | 1 | 1 | 2010-02-26 | 19403.54 | False | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | False | A | 151315 |
4 | 1 | 1 | 2010-03-05 | 21827.90 | False | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | False | A | 151315 |
Scatter plots are helpful for visualizing any linear or non-linear relationships between two variables. In our case, we suspect that, other than time, CPI, Store Size, Unemployment Rate, and Store Type could have potential impacts on Sales. After plotting each of them against Weekly Sales (see the four figures below), we can see that:
- There seems to be a slightly negative relationship betweeen CPI and Sales.
- There also seems to be a slightly positive relationship between Size and Sales.
- Unemployment, on the other hand, does not seem to affect sales much.
- Type A stores seem to have the highest sales, whereas Type C stores seem to have the lowest.
Some of this information might provide little to no help in the end. Sme of this information might be repetitive; for example, Store Type, Size, and Sales seem to be correlated to one another. However, it is important to keep these potential relationships in mind when we move on to build our models in Part 2.
fig, axes = plt.subplots(2, 2, figsize=(15,10))
sns.scatterplot(x='CPI', y='Weekly_Sales', alpha=0.2, data=df_merge, ax=axes[0,0])
sns.scatterplot(x='Size', y='Weekly_Sales', alpha=0.2, data=df_merge, ax=axes[0,1])
sns.scatterplot(x='Unemployment', y='Weekly_Sales', alpha=0.2, data=df_merge, ax=axes[1,0])
sns.stripplot(x='Type', y='Weekly_Sales', jitter=0.4, alpha=0.1, data=df_merge, ax=axes[1,1])
plt.show()
At the beginning of this Notebook, we saw that (by randomly inspecting different combinations) seasonalities and trends vary across stores and departments. Some stores/deparments have consistant sales over time, while others see huge spikes either in the middle or at the end of each year. There are also increasing or decreasing trends in certain stores/departments.
Even though it is nice to be able to see detailed sales performances for each individual department in every store, we also want to inspect whether or not there are shared patterns in our sales data. Therefore, let’s aggregate the data so that we can visualize and compare sales across different stores for the same department, and then across different departments within the same store. From the two visualizations below, we can see that, when we aggregate by department, the data look a lot cleaner and more organized than when we aggregate by stores. This again reinforces our previous finding that certain departments are significantly skewing the sales of the entire store.
fig, ax = plt.subplots(9, 5, sharex='col', sharey='row', figsize=(20, 25), dpi=100)
plt.setp(ax, ylim=(min(df_sales['Weekly_Sales']), max(df_sales['Weekly_Sales'])))
fig.autofmt_xdate(rotation=-30, ha='left')
store_num = pd.Series(range(1,46), dtype='category')
store_index = 0
for i in range(9):
for j in range(5):
sales_temp = df_sales[df_sales.Store==store_num[store_index]][['Date', 'Dept', 'Weekly_Sales']]
sales_temp['Dept'] = sales_temp['Dept'].astype('category')
sns.lineplot(x='Date', y='Weekly_Sales', hue='Dept', data=sales_temp, legend=False, ax=ax[i,j])
ax[i,j].text(min(sales_temp['Date']), max(df_sales['Weekly_Sales'])-100000,
'Store {}'.format(store_index+1), fontsize=12)
store_index += 1
plt.show()
fig, ax = plt.subplots(20, 5, sharex='col', sharey='row', figsize=(20, 50), dpi=100)
plt.setp(ax, ylim=(min(df_sales['Weekly_Sales']), max(df_sales['Weekly_Sales'])))
fig.autofmt_xdate(rotation=-30, ha='left')
dept_num = pd.Series(range(1,101), dtype='category')
dept_index = 0
for i in range(20):
for j in range(5):
try:
sales_temp = df_sales[df_sales.Dept==dept_num[dept_index]][['Date', 'Store', 'Weekly_Sales']]
sales_temp['Store'] = sales_temp['Store'].astype('category')
sns.lineplot(x='Date', y='Weekly_Sales', hue='Store', data=sales_temp, legend=False, ax=ax[i,j])
ax[i,j].text(min(sales_temp['Date']), max(df_sales['Weekly_Sales'])-100000,
'Department {}'.format(dept_index+1), fontsize=12)
dept_index += 1
except:
dept_index += 1
plt.show()