Instacart Grocery Basket Analysis

1. Project Goals

Motivation
Instacart, an online grocery store that operates through an app, wants to uncover more information about their sales patterns, with the goal of enhancing them.

Objective
Perform an Exploratory Data Analysis (EDA) of their 2017 sales data to derive insights on customers purchasing habits and suggest strategies for better segmentation and targeted marketing.

2. Key Questions

Many business questions were raised in the beginning of the project, and several others were added during the analysis. The following deserve a highlight:

What are the busiest days of the week and busiest hours of the day?
Are our customers spending more at certain times of the day/week?
Are certain types of products more popular than others, both in sales volume and revenue?
How are our customers segmented in:

Loyalty? (how often do they return to Instacart)
Based on loyalty, are there differences in ordering habits (frequency and volume)?
Same as above, but based on demographics like age, family status, income, etc.?
Do different segments show different patterns in the types of products purchased?

3. Process

The data for this project had two sources:

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed via Kaggle on 30/06/2025
A fictional customers dataset created in the scope of Career Foundry's Data Analytics course.

Due to the nature and size of data, where the largest dataset had over 32M records, Python was selected as the main tool for the project. Libraries like Pandas for all dataset handling, NumPy supporting some specific computational needs, and MatPlotlib together with Seaborn to develop the visualizations during the EDA and later for final reporting, add to the technology package used. All of this being managed behind by Anaconda, as library manager for Python, and using Jupyter Notebook for script development.
By specific request of the client, a final report was prepared in Excel, based on their own template.

Script development using Jupyter Notebook, calling main libraries used

All the scripts developed during the project are available in the Project's Github repository

Before the analysis, a careful preparation of the datasets was done taking care of all aspects of data cleaning as: find and fix mixed datatypes; handle missing values; find and address duplicate records; identify and fix outliers in numeric fields; remove unnecessary columns (mainly for memory optimization).
Below there is an overview of the population flow and the merging process between the datasets. In the final stages, a random slicing was needed due to memory restrictions. This was done using a method called seed() that ensures reproducibilty of this random selection.

Population Flow — Overview of population flow, from original datasets (in grey) to final merges.

4. Key Insights from Data

Distribution of orders and revenue across the day/week

The preferred time window for our customers to place orders is between 10:00 and 16:00.
The number of orders and revenue follow the same pattern throughout the day.

Saturday and Sunday are the preferred days to shop.
Number of orders and revenue follow the same pattern throughout the week.

Products sold and revenue across departments

Produce and dairy eggs rank on top of both products sold and revenue.
Other departments vary, but beverages and frozen appear on top 5 in both lists.

Loyalty

First off, why is customer loyalty key in retail business?

Among many other benefits, if customers shop more frequently with Instacart they will:

most likely shop way less with the competition, ensuring better overall market share
ensures a more steady and predictable stream of income

In this project, loyalty was defined by the number of orders placed as:

loyal with 40 orders or more
regular with 10 orders or more and less than 40
new with less than 10 orders

We can see that only 10% are loyal customers, but they generate 34% of the orders

Ordering Behaviours

Ordering behaviours can mean several different things. Several angles were explored in the project, with and without segmentation, such as:

most usual order day
days in between orders
average order cost
total number of orders
total spent

Some of the most interesting visualizations in this regard are displayed below.

Saturday is prefered by new and regular customers, while Sunday slightly wins in loyal group
Saturday is prefered by all income groups, except low income.
The low income group spread their orders throughout the week like any other group

Around 12% return exactly after 7 days, with 30% shopping within 7 days or less
95% of our loyal customers return within 7 days
Regular and new are much less consistent, with only 29% and 15% respectively

5. Results

Recommendations derived from key insights to operation, sales and marketing

Target/nudge 'regular' customers to increase loyalty base, and consequently number of orders
Run surveys to understand what would engage 'regular/new' customers to order more (and eventually more frequently)
Send reminders through various platforms (emails/push notifications) after an order with the promos that customers could use next time
Ensure that these promos expire on a weekly basis, to engage customers in using them before expiry, i.e, shop weekly or sooner
Ensure that staff on weekends is adjusted to guarantee compliance in the delivery times and customer satisfaction
Consider performing specific discounts on Tuesdays and Wednesdays (the less busy days), especially directed for loyal and regular customers (the responsible for the largest amount of orders), to try smooth out the demand across the week

Link to download Final Report (excel)

The full project documentation can be found on Github Repository, including this final report in Excel.

Personal Reflection

Working with a totally realistic scenario and a huge dataset (over 32 million records) was both challenging and exciting. However, the customers dataset created for the course was found to be very artifitial (several examples that ilustrate this were noted down along the scripts developed on the EDA phase).
Developing further my knowledge of Python, in particular in relevant libraries like Pandas, NumPy,MatPlotlib and Seaborn, was probably the biggest personal gain from this project.
This project clearly highlighted that one learns much more by doing, and many times researching 'how to' and adapt to one's own needs.