In this project, we delve into a RFM (Recency, Frequency, Monetary) analysis of ecommerce customers’ data from 2016 to 2018, provided by a Brazilian eCommerce company Olist.
Our objective is to derive insight from the spending habits of customers, such as how long ago their purchase was, how much it was, how frequently they purchase things (if any), and put them into categories of customers.
We do not yet know what kind of customers we may find, but based on our own habits of purchasing, we could hypothesize certain types, such as: consumers that consistently spend money on a relatively moderate level (loyal customers), consumers that only commit 1 large purchases from time to time (perodic large spenders), and customers who have only purchased 1 thing a long time ago and has never come back since (lost customers).
Let’s first import all of our data.
customers <- read.csv("olist_customers_dataset.csv");
geolocation <- read.csv("olist_geolocation_dataset.csv");
order_items <- read.csv("olist_order_items_dataset.csv");
order_payments <- read.csv("olist_order_payments_dataset.csv");
order_reviews <- read.csv("olist_order_reviews_dataset.csv");
orders <- read.csv("olist_orders_dataset.csv");
payments <- read.csv("olist_products_dataset.csv");
sellers <- read.csv("olist_sellers_dataset.csv");
Now we import all of our required libraries.
# install.packages('tidyverse');
# install.packages('lubridate');
# install.packages('hms');
# install.packages('scales');
# install.packages('dplyr');
# install.packages('ggplot2');
# install.packages('cluster');
# install.packages('scales');
# install.packages('factoextra');
library(dplyr);
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2);
library(cluster);
library(tidyverse);
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate);
library(hms);
##
## Attaching package: 'hms'
##
## The following object is masked from 'package:lubridate':
##
## hms
library(scales);
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(factoextra);
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
To get a better understanding of the data given, let’s take a look at the schema provided:
As we can see, we have lots of data that we can use. However, for our purposes of customer segmentation, we will focus on the olist_customer_dataset and olist_orders_dataset that is linked by customer_id. We wish to pivot our RFM analysis on specific customers by their customer_id.
So, we combine the two files, olist_orders_dataset and olist_customer_dataset.
orders_customers <- merge(orders, customers, by = "customer_id")
data <- merge(orders_customers, order_payments, by = "order_id")
head(data)
## order_id customer_id
## 1 00010242fe8c5a6d1ba2dd792cb16214 3ce436f183e68e07877b285a838db11a
## 2 00018f77f2f0320c557190d7a144bdd3 f6dd3ec061db4e3987629fe6b26e5cce
## 3 000229ec398224ef6ca0657da4fc703e 6489ae5e4333f3693df5ad4372dab6d3
## 4 00024acbcdf0a6daa1e931b038114c75 d4eb9395c8c0431ee92fce09860c5a06
## 5 00042b26cf59d7ce69dfabb4e55b4fd9 58dbd0b2d70206bf40e62cd34e84d795
## 6 00048cc3ae777c65dbb7d2a0634bc1ea 816cbea969fe5b689b39cfc97a506742
## order_status order_purchase_timestamp order_approved_at
## 1 delivered 2017-09-13 08:59:02 2017-09-13 09:45:35
## 2 delivered 2017-04-26 10:53:06 2017-04-26 11:05:13
## 3 delivered 2018-01-14 14:33:31 2018-01-14 14:48:30
## 4 delivered 2018-08-08 10:00:35 2018-08-08 10:10:18
## 5 delivered 2017-02-04 13:57:51 2017-02-04 14:10:13
## 6 delivered 2017-05-15 21:42:34 2017-05-17 03:55:27
## order_delivered_carrier_date order_delivered_customer_date
## 1 2017-09-19 18:34:16 2017-09-20 23:43:48
## 2 2017-05-04 14:35:00 2017-05-12 16:04:24
## 3 2018-01-16 12:36:48 2018-01-22 13:19:16
## 4 2018-08-10 13:28:00 2018-08-14 13:32:39
## 5 2017-02-16 09:46:09 2017-03-01 16:42:31
## 6 2017-05-17 11:05:55 2017-05-22 13:44:35
## order_estimated_delivery_date customer_unique_id
## 1 2017-09-29 00:00:00 871766c5855e863f6eccc05f988b23cb
## 2 2017-05-15 00:00:00 eb28e67c4c0b83846050ddfb8a35d051
## 3 2018-02-05 00:00:00 3818d81c6709e39d06b2738a8d3a2474
## 4 2018-08-20 00:00:00 af861d436cfc08b2c2ddefd0ba074622
## 5 2017-03-17 00:00:00 64b576fb70d441e8f1b2d7d446e483c5
## 6 2017-06-06 00:00:00 85c835d128beae5b4ce8602c491bf385
## customer_zip_code_prefix customer_city customer_state
## 1 28013 campos dos goytacazes RJ
## 2 15775 santa fe do sul SP
## 3 35661 para de minas MG
## 4 12952 atibaia SP
## 5 13226 varzea paulista SP
## 6 38017 uberaba MG
## payment_sequential payment_type payment_installments payment_value
## 1 1 credit_card 2 72.19
## 2 1 credit_card 3 259.83
## 3 1 credit_card 5 216.87
## 4 1 credit_card 2 25.78
## 5 1 credit_card 3 218.04
## 6 1 boleto 1 34.59
Now we move on to the actual RFM (Recency, Frequency, Monetary) analysis of the data.
To analyze the Recency of each customer, we look at their last valid purchase of anything in the data.
To analyze the Frequency of each customer, we look at the time passed since each of their purchases over the timeline of the data.
To analyze the Monetary of each customer, we look at the dollar amount (or rather Reals since this is a Brazilian company) of each customer has incurred.
#Frequency analysis
frequency <- data %>%
group_by(customer_unique_id) %>%
summarise(Frequency = n_distinct(order_id))
#Monetary analysis
monetary <- data %>%
group_by(customer_unique_id) %>%
summarise(Monetary = sum(payment_value))
# Ensure order_purchase_timestamp is in date-time format
data$order_purchase_timestamp <- ymd_hms(data$order_purchase_timestamp)
To perform the Recency analysis, we note the fact that this data ranges from 2016 to 2018. To analyze the recency of data that is more than 5 years old would be confusing, so we use the latest recoreded time on the data.
max_time <- max(data$order_purchase_timestamp)
# Calculate Recency
recency <- data %>%
group_by(customer_unique_id) %>%
summarise(Last_Purchase_Date = max(order_purchase_timestamp)) %>%
mutate(Recency = as.numeric(difftime(max_time, Last_Purchase_Date, units = "days")))
Now we aggregate all of the data together and sort them by the unique customer id, and normalize them.
frm_data <- merge(frequency, monetary, by = "customer_unique_id")
frm_data <- merge(frm_data, recency[,c("customer_unique_id", "Recency")], by = "customer_unique_id")
frm_data_normalized <- frm_data[, c("Frequency", "Monetary", "Recency")]
frm_data_normalized <- scale(frm_data_normalized)
We now move onto the k-means classification. In the introduction of this project, we said that we don’t know the exact number of segmentations. However, for our purposes, we can safely assume that there would not exist more than 10 types of customers. So, we try each number of segmentation from 1 to 10, and find the number of clusters that yields the lowest total variance (WCSS).
wcss <- sapply(1:10, function(k) {
sum(kmeans(frm_data_normalized, centers = k, nstart = 20)$withinss)
})
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 4804750)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 4804750)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 4804750)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 4804750)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 4804750)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 4804750)
plot(1:10, wcss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")
Here, we employ the Elbow Method to determine the number of clusters. It looks like the clusters total variance decreases quite a bit after 6 clusters. Let’s then use 6.
optimal_k <- 6 #6 is the value identified from the Elbow Method
km.out <- kmeans(frm_data_normalized, centers = optimal_k, nstart = 20)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 4804750)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 4804750)
We see that the Sum of Squres Regression score is about 76.9%, which isn’t too bad. We now use the above classifications to assign customer cluster to each individual customer id.
frm_data$Cluster <- km.out$cluster
Now, we have a total of 16 columns, which we can’t view. However, we can use Principal Component Analysis to extract the two most important features in the data so that we may visualize the clusters.
# since we cant visualize all dimensions, perform PCA to extract two most important features
pca_result <- prcomp(frm_data_normalized, center = TRUE, scale. = TRUE)
frm_data_pca <- data.frame(pca_result$x[,1:2]) # Extract the first two principal components
frm_data_pca$Cluster <- km.out$cluster # Add cluster information
ggplot(frm_data_pca, aes(x = PC1, y = PC2, color = as.factor(Cluster))) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(color = "Cluster")
We have the 6 clusters, but we don’t exactly know what they mean. Let’s take a look at their characteristics:
cluster_profiles <- aggregate(frm_data[, c("Frequency", "Monetary", "Recency")],
by = list(Cluster = frm_data$Cluster),
mean)
cluster_profiles
## Cluster Frequency Monetary Recency
## 1 1 2.115566 292.0657 268.7701
## 2 2 1.000000 735.0417 279.4626
## 3 3 1.067194 2166.7901 287.1251
## 4 4 1.000000 126.2384 136.2352
## 5 5 1.000000 121.6694 305.0283
## 6 6 1.000000 127.1590 509.4116
From what we can see from the FRM analysis, we could deduce the following:
Cluster Analysis Interpretation:
Frequency: 2.1 (above 1, which suggests multiple purchases) Monetary: $292.07 (moderate spending) Recency: 268.77 days (relatively recent activity) Interpretation: Engaged customers who make purchases relatively frequently and have shopped in the last year. They are not big spenders but are consistent. Strategy: Encourage continued engagement through loyalty programs, personalized marketing, and regular updates on new products that match their purchase history.
Frequency: 1 (single purchase) Monetary: $121.67 (lower spending) Recency: 305.03 days (almost a year since last purchase) Interpretation: One-time shoppers who may have tried a product or service but have not returned. Strategy: Re-engagement campaigns with incentives to make a second purchase, collecting feedback to understand their lack of repeat purchases.
Frequency: 1 (single purchase) Monetary: $126.24 (similar to Cluster 2) Recency: 136.24 days (relatively recent purchase) Interpretation: Newer customers who have made a single purchase recently. There’s potential for these customers to return. Strategy: Initiate follow-up communication to encourage a second purchase. Provide offers that are time-sensitive to capitalize on their recent engagement.
Frequency: 1.07 (mostly single purchases with few repeat customers) Monetary: $2166.79 (very high spending) Recency: 287.13 days (less recent activity) Interpretation: Likely to be customers who made significant one-off purchases, possibly due to a major need or an attractive offer. Strategy: Upsell related products or accessories, offer extended warranties, and keep them engaged with after-sales services.
Frequency: 1 (single purchase) Monetary: $127.16 (lower spending) Recency: 509.41 days (long time since last purchase) Interpretation: Customers who have not engaged for a long time and had low spending. These are likely churned or lost customers. Strategy: Reactivation efforts with compelling deals or information about significant changes/improvements to products or services since their last purchase.
Frequency: 1 (single purchase) Monetary: $735.04 (moderately high spending) Recency: 279.46 days (about 9 months since last purchase) Interpretation: Customers who may be occasional shoppers, potentially making seasonal or periodic large purchases. Strategy: Engage with targeted campaigns during likely purchase periods, offer loyalty incentives for additional purchases outside their normal buying pattern.
With this clustering, we can monitor incoming customer data to predict which cluster new customers are likely to fall into and proactively manage your engagement with them. Additionally, we can track the migration of customers between clusters over time to identify trends and adjust strategies accordingly.