RFM Analysis

Author: Kemjika Ananaba

Customer analytics are important in order to run a successful business. Sales and marketing resources are finite and expensive; therefore, it is important to answer the following questions when developing a marketing strategy:

  • Who are our most/least valuable customers?
  • How can we acquire new customers who resemble our most valuable customers?

In this Project, we focus on RFM (Recency, Frequency, and Monetary value) analysis. This type of analysis is used when historical data for existing customers are available. We encourage you to create your own Jupytor Notebook and follow along. You can also download this Notebook together with any accompanying data in the Notebooks and Data GitHub Repository. Alternatively, if you do not have Python or Jupyter Notebook installed yet, you may experiment with a virtual Notebook by launching Binder or Syzygy below (learn more about these two tools in the Resource tab).

Launch Syzygy (UBC)

Launch Syzygy (Google)

Launch Binder

Background

RFM Analysis identifies existing customers who are most likely to remain loyal to a company or respond to a new product. Among other things, RFM analysis allows us to create customized and personalized messaging, which can be used to streamline the various messages you send to a specific customer and continue sending messages of only a particular type, thereby reducing the chance of dissatisfaction or annoyanc while creating higher customer satisfaction. RFM is a model based on historical data and helps forecast future behavior based on past interactions. The RFM Model of customer value uses proven marketing principles to help businesses differentiate between marketing to existing customers versus new users, allowing businesses to create relevant and personalized messaging by understanding user behavior. The model allows businesses to segment its users according to three criteria based on an existing customer’s transaction history, namely:

  1. The Recency of the customer’s last purchase
  2. The Frequency of the customer’s purchases
  3. The total amount spent, which is referred to as the customer’s Monetary value

The RFM model, when used in conjunction with traditional models of segmentation, can help businesses visualize new and existing customers differently, and thus create favourable conditions to maximize customer lifetime value.

Business Problem

How do we segment our customers to reach them effectively through targeted marketing?

By applying the Pareto principle to a customer database, it is often the case that 80% of a company’s revenue comes from 20% of its customers. RFM helps determine who the top 20% customers are and groups the remaining customers into other segments, such as price-sensitive, potentially loyal, brand sensitive, etc. This ensures that each customer is better targeted, converting the other segments into the higher-value segments, and increases customer lifetime value.

In this Project, the customer sales data from January 2011 to May 2011 of an e-commerce store are analyzed to group customers from the United Kingdom into segments using the RFM model. The relevant libraries and the provided dataset are imported into the Python environment.

Dataset

There are eight variables in the dataset:

  1. InvoiceNo: invoice number, a 6-digit integral number uniquely assigned to each transaction.
  2. StockCode: product code, a 5-digit integral number uniquely assigned to each distinct product.
  3. Description: product name
  4. Quantity: quantities of each product (item) per transaction.
  5. InvoiceDate: invoice Date and time, a numeric value of the date and time when each transaction was generated.
  6. UnitPrice: unit price, a numeric value of the product price, per unit, in sterling.
  7. CustomerID: customer number, a 5-digit integral number uniquely assigned to each customer.
  8. Country
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import timedelta
import matplotlib.pyplot as plt

pd.set_option("display.max_columns",None);
pd.set_option("display.max_rows",None);

#load data
url = 'https://raw.githubusercontent.com/Master-of-Business-Analytics/Project_06_RFM_Analysis/master/nb0006_data.csv'
df = pd.read_csv(url, error_bad_lines=False,encoding= 'unicode_escape')

df.head(5)

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
054055821258?-291/10/2011 10:040.00NaNUnited Kingdom
154056021589NaN-141/10/2011 10:080.00NaNUnited Kingdom
254056122343PARTY PIZZA DISH RED RETROSPOT241/10/2011 10:320.2113004.0United Kingdom
354056122344PARTY PIZZA DISH PINK POLKADOT241/10/2011 10:320.2113004.0United Kingdom
454056147504HENGLISH ROSE SPIRIT LEVEL241/10/2011 10:320.8513004.0United Kingdom

Data Cleaning

The dataset seems to be relatively clean at first glance, but it is actually riddled with null values. The code chunk below creates a new column that calculates the total purchase for each transaction. The NA values are also identified, and the rows containing these NA values are removed.

There are several methods of removing NA values from a dataset, such as replacing the NA value with another value or deleting the row entirely. The latter method, to choose removing the row entirely, is dependent on what data have already been manipulated and how much data are available.

df.describe()

QuantityUnitPriceCustomerID
count290603.000000290603.000000214539.000000
mean8.3632654.30262615528.122756
std292.79136372.5316521592.760180
min-80995.0000000.00000012346.000000
25%1.0000001.25000014159.000000
50%3.0000002.10000015493.000000
75%10.0000004.13000016923.000000
max80995.00000017836.46000018287.000000

In any event, it is good practice to sift out erroneous transactions, particularly when they result in extreme outliers. In this section, the large transactions are scanned to clean the data. More information on data cleaning can be found in our Data Cleaning Project.

# information about the data set
print('{:,} rows; {:,} columns'
      .format(df.shape[0], df.shape[1]))
print('{:,} transactions don\'t have a customer id'
      .format(df[df.CustomerID.isnull()].shape[0]))
print('Transactions timeframe from {} to {}'.format(df['InvoiceDate'].min(),
                                    df['InvoiceDate'].max()))

#Eliminating NA rows in Cstomer ID

df=df[df.notnull().all(axis=1)] #eliminating NA
290,603 rows; 8 columns
76,064 transactions don't have a customer id
Transactions timeframe from 1/10/2011 10:04 to 5/20/2011 14:06
print('{:,} rows; {:,} columns'
      .format(df.shape[0], df.shape[1]))
214,539 rows; 8 columns
#remove mi
df.drop(index=[df['Quantity'].idxmax(), df['Quantity'].idxmin()], inplace=True)
df.describe()

QuantityUnitPriceCustomerID
count214537.000000214537.000000214537.000000
mean10.5616423.04838815528.114199
std232.16835114.1651011592.765139
min-74215.0000000.00000012346.000000
25%2.0000001.25000014159.000000
50%4.0000001.95000015493.000000
75%12.0000003.75000016923.000000
max74215.0000003155.95000018287.000000

df.drop(index=[df['Quantity'].idxmax(), df['Quantity'].idxmin()], inplace=True)
df.describe()

QuantityUnitPriceCustomerID
count214535.000000214535.000000214535.000000
mean10.5617413.04840715528.143865
std50.55084314.1651661592.742929
min-3114.0000000.00000012747.000000
25%2.0000001.25000014159.000000
50%4.0000001.95000015493.000000
75%12.0000003.75000016923.000000
max12540.0000003155.95000018287.000000

Note that the minimum quantity and maximum quantity have different absolute values.

df = df[df['UnitPrice']>0]
df.describe()

QuantityUnitPriceCustomerID
count214516.000000214516.000000214516.000000
mean10.5033613.04867715528.224948
std42.70572514.1657641592.710040
min-3114.0000000.00100012747.000000
25%2.0000001.25000014159.000000
50%4.0000001.95000015493.000000
75%12.0000003.75000016923.000000
max4800.0000003155.95000018287.000000
# Convert InvoiceDate from object to datetime format
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

#creating a total sales column
df["Totalsum"] = df['Quantity'] * df['UnitPrice']
#Check the remaining large negative transactions
df[df.Totalsum<0].sort_values('Totalsum').head(10)

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryTotalsum
257377C55045621108FAIRY CAKE FLANNEL ASSORTED COLOUR-31142011-04-18 13:08:002.1015749.0United Kingdom-6539.40
257375C55045685123AWHITE HANGING HEART T-LIGHT HOLDER-19302011-04-18 13:08:002.5515749.0United Kingdom-4921.50
257373C55045648185DOORMAT FAIRY CAKE-6702011-04-18 13:08:006.7515749.0United Kingdom-4522.50
257376C55045621175GIN + TONIC DIET METAL SIGN-20002011-04-18 13:08:001.8515749.0United Kingdom-3700.00
257374C55045647566BTEA TIME PARTY BUNTING-13002011-04-18 13:08:002.5515749.0United Kingdom-3315.00
275098C549452DDiscount-12011-04-08 14:17:001867.8617940.0United Kingdom-1867.86
35080C57055622273FELTCRAFT DOLL MOLLY-7202011-10-11 11:10:002.5516029.0United Kingdom-1836.00
35074C57055622273FELTCRAFT DOLL MOLLY-7202011-10-11 11:10:002.3016029.0United Kingdom-1656.00
69265C56938723284DOORMAT KEEP CALM AND COME IN-2002011-10-03 16:49:007.0814031.0United Kingdom-1416.00
35081C57055620971PINK BLUE FELT CRAFT TRINKET BOX-12962011-10-11 11:10:001.0616029.0United Kingdom-1373.76
# Customer 15749 is in the top 10
df[df.CustomerID==15749].sort_values('Totalsum').head(10)

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryTotalsum
257377C55045621108FAIRY CAKE FLANNEL ASSORTED COLOUR-31142011-04-18 13:08:002.1015749.0United Kingdom-6539.4
257375C55045685123AWHITE HANGING HEART T-LIGHT HOLDER-19302011-04-18 13:08:002.5515749.0United Kingdom-4921.5
257373C55045648185DOORMAT FAIRY CAKE-6702011-04-18 13:08:006.7515749.0United Kingdom-4522.5
257376C55045621175GIN + TONIC DIET METAL SIGN-20002011-04-18 13:08:001.8515749.0United Kingdom-3700.0
257374C55045647566BTEA TIME PARTY BUNTING-13002011-04-18 13:08:002.5515749.0United Kingdom-3315.0
25777555046147556BTEA TIME TEA TOWELS13002011-04-18 13:20:002.1015749.0United Kingdom2730.0
220454081847556BTEA TIME TEA TOWELS13002011-01-11 12:57:002.5515749.0United Kingdom3315.0
25777755046121175GIN + TONIC DIET METAL SIGN20002011-04-18 13:20:001.6915749.0United Kingdom3380.0
214354081521175GIN + TONIC DIET METAL SIGN20002011-01-11 12:55:001.8515749.0United Kingdom3700.0
25777655046148185DOORMAT FAIRY CAKE6702011-04-18 13:20:006.3515749.0United Kingdom4254.5
# removing cancelled orders
#index 257377,257778, 257375, 257373,2142,2205,2143,257374,257376,2204 - all calcellations

df.drop(index=[257377,257778, 257375, 257373,2142,2205,2143,257374,257376,2204], inplace=True)
# customer 16029
df[df.CustomerID==16029].sort_values('InvoiceDate').head(10)

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryTotalsum
25750C54003022070SMALL RED RETROSPOT MUG IN BOX-242011-01-04 13:47:003.7516029.0United Kingdom-90.00
2574954002922207FRYING PAN UNION FLAG242011-01-04 13:47:003.7516029.0United Kingdom90.00
567954110585099FJUMBO BAG STRAWBERRY2002011-01-13 14:34:001.6516029.0United Kingdom330.00
568054110522386JUMBO BAG PINK POLKADOT2002011-01-13 14:34:001.6516029.0United Kingdom330.00
1309054161084880WHITE WIRE EGG HOLDER172011-01-20 10:18:004.9516029.0United Kingdom84.15
1309154161022207FRYING PAN UNION FLAG482011-01-20 10:18:003.7516029.0United Kingdom180.00
1309254161021928JUMBO BAG SCANDINAVIAN PAISLEY1002011-01-20 10:18:001.6516029.0United Kingdom165.00
1309354161179321CHILLI LIGHTS962011-01-20 10:18:003.8216029.0United Kingdom366.72
1309454161122779WOODEN OWLS LIGHT GARLAND482011-01-20 10:18:003.3716029.0United Kingdom161.76
1949354199285099BJUMBO BAG RED RETROSPOT1002011-01-25 09:31:001.6516029.0United Kingdom165.00
# removing rows with the large equal absolute values
df.drop(index=[35090,35080,35074,83508,35077,83505,35075,35084,35076,83502,35089,35071,35072,35032,35088,35026,83471,35082,35079,35073,83469,35087], inplace=True)


df.describe()

QuantityUnitPriceCustomerIDTotalsum
count214484.000000214484.000000214484.000000214484.000000
mean10.5049283.04872615528.16329017.793502
std40.05082514.1668121592.82005962.596212
min-1350.0000000.00100012747.000000-1867.860000
25%2.0000001.25000014159.0000003.750000
50%4.0000001.95000015493.0000009.950000
75%12.0000003.75000016923.00000017.700000
max4800.0000003155.95000018287.0000006539.400000

The customer records are grouped by the recency of the purchases, the frequency of the quantities, and the monetary value of the purchases.

Recency

To calculate recency, a snapshot date is created. This date is one day after the most recent invoice date of the dataset. The date difference is used to show the recency of the purchase.

# Create snapshot date
snapshot_date = df['InvoiceDate'].max() + timedelta(days=1)  # set current date as most recent date plus one day
rec_df = snapshot_date-df.groupby("CustomerID").agg({"InvoiceDate":max})
#group customers by their most recent purchase

rec_df.rename(columns={"InvoiceDate": "Recency"}, inplace=True)#rename column
rec_df=rec_df["Recency"].apply(lambda x: x.days) #extracting number of days only
rec_df.head(5)
CustomerID
12747.0      2
12748.0      1
12749.0      4
12820.0      3
12823.0    255
Name: Recency, dtype: int64

Frequency

To calculate the frequency, the number of invoices per customer is calculated.

freq_df = df.groupby("CustomerID").agg({'InvoiceNo': 'count'})
freq_df.rename(columns={"InvoiceNo": "Frequency"}, inplace=True)#rename column
freq_df.head()

Frequency
CustomerID
12747.050
12748.02729
12749.0114
12820.045
12823.03

Monetary

To calculate the monetary score, the total value of each customer’s purchases is calculated.

mon_df = df.groupby("CustomerID").agg({'Totalsum': 'sum'})
mon_df.rename(columns={"Totalsum": "Monetary"}, inplace=True)#rename column
mon_df.head(5)

Monetary
CustomerID
12747.02040.43
12748.015727.57
12749.02194.75
12820.0724.57
12823.0994.50

RFM Model

rfm=pd.concat([rec_df,freq_df, mon_df], axis=1) #combining all scores
rfm.head(7)

RecencyFrequencyMonetary
CustomerID
12747.02502040.43
12748.01272915727.57
12749.041142194.75
12820.0345724.57
12823.02553994.50
12824.06025397.12
12826.0362964.20

Creating Rank Scores

The customers are ranked in each of the three criteria (R,F, and M). Each criteria is broken into five buckets, and customer scores are assigned based on the relative percentile of each RFM feature, with each bucket representing 20% of the distribution. This approach of scaling customers from 1-5 will result in, at most, 125 different RFM scores (5x5x5), ranging from 111 (lowest) to 555 (highest).

# --Calculate R and F groups--

# Create labels for Recency and Frequency and MonetaryValue
r_labels = range(5, 0, -1) #[4,3,2,1]
f_labels = range(1, 6)   #[1,2,3,4]
m_labels = range(1, 6)

# Assign these labels to 4 equal percentile groups
r_groups = pd.qcut(rfm['Recency'], q=5, labels=r_labels)
# Assign these labels to 4 equal percentile groups
f_groups = pd.qcut(rfm['Frequency'], q=5, labels=f_labels)

# Assign these labels to three equal percentile groups
m_groups = pd.qcut(rfm['Monetary'], q=5, labels=m_labels)



# Create new columns R_score, M_score and F_score
rfm = rfm.assign(R_score = r_groups.values, F_score = f_groups.values, M_score = m_groups.values)
rfm.head()

RecencyFrequencyMonetaryR_scoreF_scoreM_score
CustomerID
12747.02502040.43545
12748.01272915727.57555
12749.041142194.75555
12820.0345724.57534
12823.02553994.50114

RFM Score

There are many ways to calculate the aggregate RFM score. Most companies allocate different weights to across the three criteria. For example, a customer of R_score=1, F_score=2 , M_score=3 could have the following possible RFM scores:

  1. 110+20+3 = 123 if weights 100, 10 , and 1 were allocated to the three criteria, respectively.
  2. 1+2+3 = 6 if each of the three criteria are given an equal weight of 1.
  3. (1+2+3)/3 = 2 if we take the average of the three criteria.

Depending on the nature of the business, we might increase or decrease the relative importance (the weight) of the three criteria variables to arrive at the final RFM score. In this Project, we continue forward by simply calculating the customer’s aggregate RFM score as the average of the three R, F, and M scores. This calculation method will provide an aggregate RFM score from 1 (lowest) to 5 (highest).

#create avegare rfm score column
rfm['Ave_Score'] = rfm[['R_score','F_score','M_score']].mean(axis=1)
#create rfm score by combining all scores
rfm['RFM score'] = rfm.R_score.astype(str) \
                            + rfm.F_score.astype(str) \
                            + rfm.M_score.astype(str)
rfm.head()

RecencyFrequencyMonetaryR_scoreF_scoreM_scoreAve_ScoreRFM score
CustomerID
12747.02502040.435454.666667545
12748.01272915727.575555.000000555
12749.041142194.755555.000000555
12820.0345724.575344.000000534
12823.02553994.501142.000000114

Customer Classification Based on RFM Analysis

There are many ways we can use the RFM scores to create customer segments. The method used depends on the objective of the analysis. One way of classifying customers is by creating segments based on the average RFM score. Here, we adopt some common segment names for our quintile segments.

Customer SegmentAverage Score
Can’t lose them5
Champions4-5
Loyal3-4
Needs Attention2-3
Require Activation0-2
# Define rfm_level function
def rfm_level(df):
    if df['Ave_Score'] >= 5:
        return 'Can\'t Loose Them'
    elif ((df['Ave_Score'] >= 4) and (df['Ave_Score'] < 5)):
        return 'Champions'
    elif ((df['Ave_Score'] >= 3) and (df['Ave_Score'] < 4)):
        return 'Loyal'
    elif ((df['Ave_Score'] >= 2) and (df['Ave_Score'] < 3)):
        return 'Needs Attention'
    else:
        return 'Require Activation'

# Create a new variable RFM_Level by applying function to table

rfm['RFM_Level'] = rfm.apply(rfm_level, axis=1)
# Print the header with top 5 rows to the console
rfm.head()

RecencyFrequencyMonetaryR_scoreF_scoreM_scoreAve_ScoreRFM scoreRFM_Level
CustomerID
12747.02502040.435454.666667545Champions
12748.01272915727.575555.000000555Can't Loose Them
12749.041142194.755555.000000555Can't Loose Them
12820.0345724.575344.000000534Champions
12823.02553994.501142.000000114Needs Attention

Calculating the size of each segment

# Calculate average values for each RFM_Level, and return a size of each segment
rfm_level_agg = rfm.groupby('RFM_Level').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': ['mean', 'count']
}).round(1)


rfm_level_agg

RecencyFrequencyMonetary
meanmeanmeancount
RFM_Level
Can't Loose Them4.0290.76228.5220
Champions19.1119.31954.5691
Loyal58.650.8784.6810
Needs Attention113.122.5388.5918
Require Activation215.49.9163.1634
#visualizing customer segments based on average scores
rfm_level_agg['Monetary']['count'].plot(kind="barh", use_index=True, color=['red', 'orange', 'yellow', 'green', 'gold'])
#rfm['Count'].plot(kind="barh", use_index=True, color=['red', 'orange', 'yellow', 'green', 'gold'])
plt.title("Customer Segments based on average scores")
plt.xlabel("Segments")
plt.ylabel("Number of customers")
plt.show()

png

rfm_level_agg['Monetary']['count']
RFM_Level
Can't Loose Them      220
Champions             691
Loyal                 810
Needs Attention       918
Require Activation    634
Name: count, dtype: int64

Recall that the customer segmentation calculation method above assigns equal weighs to a customer’s recency, frequency, and monetary scores. This method of segmentation is commonly used in practice (? This sentence was incomplete)

RFM scores are practically used in various marketing strategies. For example, a business may choose to send specific types of communication or promotions based on the RFM segment in which customers appear:

  1. Best Customers: this group consists of customers have an R_score=5, F_score=5 and M_Score=5, meaning that they transacted recently, do so often, and spend more than other customers. We would calculate an aggregate RFM score of 5. Earlier, we referred to this customer segment as “Can’t lose them”. A shortened notation for this segment is 5-5-5

  2. High-Spending New Customers: this group consists of those customers in segments 5-1-5 and 5-1-4 . These are customers who transacted only once, but relatively recently, and they spent a lot.

  3. Lowest-Spending Active Loyal Customers: this group consists of those customers in segments 5-5-1 and 5-5-2. These are customers who transacted recently and do so often, but spend the least.

  4. Churned Best Customers: this group consists of those customers in segments 2-5-5, 1-5-5, 1-4-4 and 1-4-5. These are customers who transacted frequently and spent a lot, but it been a long time since they last transacted.

Final Conclusion

RFM is a relatively simple technique that can drastically improve a company’s marketing performance. RFM analysis helps organizations find answers to the following questions:

  • Who are your best customers?
  • Which of your customers could be contributing to your churn rate?
  • Which of your customers have the potential to become valuable customers?
  • Which of your customers can be retained?
  • Which of your customers are most likely to respond to engagement campaigns?

However, there are some shortcomings with traditional RFM analysis, such as:

  • Computationally awkward and labour intensive.
  • Ignores within-quintile variance.
  • Ordering of variables is arbitrary. Which do you sort by first?
  • Some variables may have too few levels.
  • Correlations between variables create problems: for example, first recency quintile may have higher frequency customers.
  • Does not generalize well to new customers.: for example, how do new customers get assigned to a quintile?

As a result, logistic regression is usually preferred to RFM analysis when predicting the behaviour of the customer.

References

  1. Aditya,A. (2020, March 20). Predictive Segments using RFM Analysis: An In-Depth Guide. Retrieved July 03, 2020, from https://www.moengage.com/blog/rfm-analysis-using-predictive-segments/

  2. Yuan, Y. (2019, August 19). Recency, Frequency, Monetary Model with Python - and how Sephora uses it to optimize their Google… Retrieved July 03, 2020, from https://towardsdatascience.com/recency-frequency-monetary-model-with-python-and-how-sephora-uses-it-to-optimize-their-google-d6a0707c5f17

  3. https://www.putler.com/rfm-analysis/

  4. https://www.optimove.com/resources/learning-center/rfm-segmentation