Market Basket Analysis Example Report Assignment

This week we welcome Guest Blogger, Bill Grant from the AllGo Analytics Limited. Bill is an experienced data professional with a new business focused on performing data analytics for small and medium-sized business.

Take it away Bill …

FACT SHEET 1: What is Market Basket Analysis?

Association Rules

There are many ways to see the similarities between items. These are techniques that fall under the general umbrella of association.

The outcome of this type of technique, in simple terms, is a set of rules that can be understood as:“if this, then that”.

Applications

So what kind of items are we talking about? There are many applications of association:

  • Product recommendation – like Amazon’s “customers who bought that, also bought this”
  • Medical diagnosis – like with diabetes
  • Content optimisation – like in magazine websites or blogs or even menu design for a restaurant
  • DNA genome analysis – patterns in cellular data
  • Fraud detection in finance

Show me an example of its use in real life

You have 30 days of transaction receipts from your grocery – a total of nearly 10,000. How can you use this valuable information to help uncover relationships between seemingly unrelated data – what goes with what? How might I lay out my store for customer convenience?

In this example you have:

9,835 transactions, ie that many separate purchases over the month.This represents around 30 transactions per hour over 12 hour trading days

169 possible individual products – this is the range of different items you stock

33 individual items in the largest transaction – this is how many items were in the biggest sale

This is too much to calculate manually or try to understand visually, we want to automate this and can do it using Association Rules, to find the patterns of association between items in this large data set.

With the right software and having the data in a simple table, we can easily find these relationships with minimal expert assistance and repeat the same analysis on other months’ data (i.e. we can replicate the process).

Using RapidMiner Studio (available free on-line from RapidMiner GmbH) we can set up the linked set of analytic modules and get the results we need in literally seconds.For our grocery data we see:

These are the strongest rules out of the 170 that were found. Clearly, whole milk is a key item in combination with others.

Noting the following technical terms from the table of rules above:

  • Support: The fraction (%) of which our item set occurs in our dataset.
  • Confidence: The probability that a rule is correct for a new transaction with items in the If column.
  • Lift: The ratio by which by the confidence of a rule exceeds the expected confidence.
    Note: if the lift is 1 it indicates that the items on the “If” and “Then” sides are not linked to each other.A lift greater than one means a positive relationship.

The graph below shows the relationship rules visually, i.e. in the top right-hand corner we have yogurt and curd together (the “If”) then linking to whole milk (“Then”).

Technical Process Visualisation

  1. 1. “Read CSV” – import the data in csv (from Excel) format
  2. 2. “Numerical to Binomial” – convert the data, which was a series of 1’s and 0’s for whether each item from the range is in a transaction (1) or not (0) to equivalent “true” and “false”. This change of format is required for the next step.
  3. 3. “FP-Growth” – pass the data through this algorithm module to generate the frequent item data sets (FP = frequent pattern and it uses a tree data structure)
  4. 4. “Create Association Rules” – use the output from step 3 to generate the rules shown above in this Fact Sheet to generate the “If, Then” rules and technical calculations (lift and the others) to describe the patterns found

RapidMiner is a software platform developed by the company of the same name that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. It is used for business and industrial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including results visualization, validation, and optimization.

RapidMiner is written in the Java programming language. RapidMiner provides a GUI to design and execute analytical workflows. Those workflows are called “Process” in RapidMiner and they consist of multiple “Operators”. Each operator is performing a single task within the process and the output of each operator forms the input of the next one.

In 2014, Gartner Research placed RapidMiner in the leader quadrant of its Magic Quadrant for Advanced Analytics. The report described RapidMiner’s strengths as a “platform that supports an extensive breadth and depth of functionality, and with that, it comes quite close to the market Leaders. RapidMiner has received over 3 million total downloads and has over 200,000 users including eBay, Intel, PepsiCo and Kraft Foods as paying customers. [Sourced from Wikipedia, September 2015]

Graphical rule visualisation used: KK Layout format

Thanks Bill for being a guest blogger. You can read more about RapidMiner here. If you would like to write a guest blog for OptimalBI please contact me, Jack Prichard at jack.prichard@optimalbi.com

Like this:

LikeLoading...

The Shopping Basket Analysis tool helps you find in your data. An association might tell you which items are frequently purchased at the same time. In data mining, this technique is a well-known method known as market basket analysis, used to analyze the purchasing behavior of customers in very large data sets. Marketers might use the information to make recommend related products to customers and to promote related products by placing them in close proximity on web pages, in catalogs, or on the shelf.

To use shopping basket analysis, the items that you want to analyze must be related by a transaction ID. For example, if you are analyzing all the orders received through a Web site, each order would have an order ID or transaction ID that is associated with one or more purchased items.

When the wizard finishes analyzing the data, it creates two new worksheets, Shopping Basket Item Groups and Shopping Basket Rules.

The Shopping Basket Item Groups worksheet contains a list of the items that frequently appear together in transactions. These common groupings are called itemsets. The worksheet also contains statistics, such as support and lift, to help you understand the significance of the itemset. If price information is available, the worksheet also creates a sum of the value of all the related items, to give an indication of the total value of the transactions.

You can filter and sort on the columns in the report. For example, you might want to view only those itemsets with 2 or more products, or order the itemsets by Average Basket Value.

The Shopping Basket Rules worksheet uses the statistics derived from analysis to create rules about how items are related. For example, a rule might be that if customers purchase Product A, they are highly likely to purchase Product B. The rules can be used to create recommendations. Each rule has supporting statistics that help you evaluate the potential strength of the rule, so that you can make a recommendation only if the rule exceeds a certain probability threshold.

  1. Open an Excel table that contains appropriate data. In the sample workbook, click the Associate worksheet.

  2. Click Shopping Basket Analysis.

  3. In the Shopping Basket Analysis dialog box, choose the column that contains the transaction ID, and then choose the column that contains the items or products you want to analyze.

  4. Optionally, you can add a column that contains product values.

  5. ClickAdvancedto open the Advanced Parameters Setting dialog box. Increase the value for Minimum support to reduce the number of products that are grouped as itemsets. Increase the Minimum rule probability to filter out very common itemsets.

Requirements

To use the Shopping Basket Analysis tool, your data must be stored in an Excel table and must contain the following columns:

  • Column that contains a unique ID representing the transaction. The ID can be numeric or text, so long as the value in each row is unique.

  • Column that contains the item or product you are analyzing.

  • An optional numeric column that represents the price or value of each item. This column is used to aggregate the value of the itemsets each product is found in, and can help you understand the total value of certain transactions.

The individual items that you are analyzing must be grouped by some identifier that represents the case, transaction, or occasion. Therefore, you would choose this transaction ID column as the identifier, and not the customer ID number or product ID number.

When the tool examines the products within each transaction, it creates an itemset for each combination of items that it finds. For example, if a customer bought three items on a single visit, there are 7 possible itemsets: each product considered on its own, each product grouped with one other product, and the combination of all three products.


You can filter out the itemsets that contain single items, but the tool needs to analyze these to generate meaningful statistics for the data set.

Support for each itemset is calculated as the number of customers who buy an itemset. In the example just stated, if there is just one customer who purchased 3 items, with 7 possible itemsets, each of the 7 itemsets has a support value of 1. As the number of customers grows, and the number of possible combinations grows, it can take much longer to process the report. However, some itemsets might have very small support. Therefore, you might decide to reduce the time it takes to generate the report by limiting the number of items in each itemset to 3 or less. Generally, larger itemsets have much lower support, so the trade-off is acceptable.

As your data set grows larger, the number of possible item groupings and rules can become overwhelming. However, you can control the number of results that are output by the tool, to focus on only the most valuable itemsets and rules. You set these options in the Shopping Basket Advanced Parameters Dialog Box.

Minimum Support

Minimum support means the number of transactions that must contain a particular itemset for the itemset to be considered significant. For example, you might not be interested in an itemset unless it had been purchased in at least 10 different transactions. There are two ways to control the threshold for itemset significance, and both use the Minimum support parameter.

As an absolute value: Enter a number that represents the count of transactions that contain the target items. For example, if you enter 10, any set of items that appears in at least 10 shopping baskets is included in the results.

As a percentage: Enter a number that represents a percentage of the entire collection of itemsets. For example, if you specify 10, all the itemsets are counted and the targeted itemset must appear constitute at least 10 percent of that total number of itemsets. If you have a very large dataset, using percentages instead of a count can help you focus on the most important item groupings.


Remember, the number of itemsets is different from the number of transactions in your data. Each transaction can contain multiple itemsets; however, most itemsets repeat several times across the dataset.

Rule Probability and Rule Importance

The probability of a rule describes how likely the result of a rule is to occur. Rule probability is calculated by using the frequency of the itemsets that support a rule. If an itemset occurs very rarely, it will have a low probability.

However, rules that have high probability may not always be useful. They might indicate itemsets that are purchased frequently and therefore need no additional promotion. The importance is designed to measure the usefulness of a rule. Sometimes a rule might have very high probability but low importance, because the prediction does not provide new information. For example, if every itemset contains a specific state of an attribute, a rule that predicts that state is trivial, even though the probability is very high.

You should experiment with these settings to see different results and determine which setting yields the most interesting rules.

The Shopping Basket Analysis tool creates two complementary reports. The first report, titled Significant Groups of Items Identified during Analysis, provides a list of all the itemsets that were found. You can use the new table tools in Microsoft Excel to sort, filter, and explore the data.

The second report, titled Shopping Basket Rules, tells you what kind of inferences can be made based on the itemsets listed in the first report. Whereas the list of itemsets is more useful for exploring and understanding your data, the list of rules is more useful for making predictions and recommendations.

Shopping Basket Item Groups Report

This report contains a list of all the possible combinations of items that were found in your data set. For example, if your transaction data contains orders, for each order the Shopping Basket Analysis tool calculates how many times the individual item was ordered, and then calculates all combinations of that item with other items.

The report lists the itemsets that were found in order of their lift. Lift is a score that tells you the importance of the itemset.

Column in reportWhat it tells you
Group of ItemsLists the itemsets, or combination of items.
Group SizeA count of the number of items in the itemset. You can filter on this field to see only pairs of items, single items, etc.
SupportA count of the number of cases where this combination occurred. You can sort on this column to see the itemsets that are most common.
Average ValueA sum of the value of the items in just this itemsets, divided by the support. You can sort and filter on this column to target products in different price ranges.
Average Basket ValueA sum of the values of all items in orders that contain this itemset, divided by support. This statistic is interesting when paired with the average value of the itemset.
LiftA score that represents how interesting this itemset is in the entire data set. Lift is calculated by getting the probability of the two items occurring together, and the dividing by the probability of two items occurring independently. As a result, if there is a strong correlation between the items, lift score will be higher.

Shopping Basket Rules Report

This report contains a set of rules that were built by analyzing the itemsets that were found. For example, if your transaction data revealed that Products A and B were frequently purchased together, the Shopping Basket Analysis tool will create a rule that predicts A given the existence of B, or B given the existence of A.

Each rule is associated with a probability, derived from the supporting data. These probabilities are useful when making recommendations. For example, you might only want to see rules that have at least a 50 percent chance of being accurate, based on existing data.

The report lists the itemsets that were found in order of their lift. Lift is a score that tells you the importance of the itemset.

Column in reportWhat it tells you
Existing ItemsLists the items that are needed to make a recommendation.

In data mining, these items are said to be on the left side of the association rule.
Predicted ItemLists the item to recommend.

In data mining, these items are said to be on the right side of the association rule.
ProbabilityDisplays the probability that this rule is correct.
SupportIndicates the number of cases in existing data that provide evidence for this rule.
Rule ValueIf you provide a value for the items in the shopping basket, this column calculates the worth of the prediction, given the cost of the items.
LiftIndicates the strength of the correlation between the items in the first column and the items in the second column. Also referred to as importance.

A lift of 0 means that there is no correlation. A positive value means that the items in the first column predict the item in the second column. The higher the number, the stronger the correlation.

The Data Mining Client for Excel, which is a separate add-in that provides more advanced data mining functionality, also contains a wizard that performs association analysis. For more information, see Associate Wizard (Data Mining Client for Excel).

For more information about the algorithm used to perform this analysis, see the topic "Microsoft Association Algorithm" in SQL Server Books Online.

Table Analysis Tools for Excel

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *