SQL Server Gems: Association Rule Mining

To understand association rule mining, let us first remember some basic Association Rule Mining terminology:

Item: Attribute/value pair e.g. Item/02 Mini
Itemset: Combination of items in a single transaction

SUPPORT: Number of transactions the itemset must appear in before it is considered to be significant.

Association Rule: X -> Y
-> indicates that Y is predicted by X

Given a set of Transaction T,

Support, s
Percentage of transactions in T that contains X U Y (i.e. both X and Y)

Confidence, c
Percentage of transactions in T that containing X also contain Y

Now, let us consider the following example:
Total transactions: 4

Consider the rule {02 Mini}->{Wireless Card}
Support: 75%

Note: Count how many transactions has the itemset {02 Mini, Wireless Card}!

Consider the rule {02 Mini}->{Wireless Card}
Confidence: 75%

TransactionID Customer Name Product Name
-------------------------------------------------
100 Kris O2 Mini
100 Kris Wireless card
100 Kris Leather Pouch
200 Jenny O2 Mini
200 Jenny Leather Pouch
300 Jamie O2 Mini
300 Jamie Wireless Card
300 Jamie JABRA Headset
300 Jamie Leather Pouch
400 Eric O2 Mini
400 Eric Wireless Card
400 Eric Leather Pouch
400 Eric JABRA Headset
-------------------------------------------------

The Association Rule mining algorithm used in SQL Server 2005 is a variant of the Apriori Algorithm, which was first proposed in 1993 by Rakesh Agrawal, Tomasz Imielinski, Arun Swami.

To reduce redundant counting, the algorithm makes use of the Apriori Property : Every subset of a frequent itemset is also a frequent itemset. This propery is crucial to reducing the number of candidate frequent itemsets generated

The generic flow of the most association rule mining algorithm is as follows:
- Scan the transactions in a database
- Identify items that have high probability to be grouped together in a transaction
- Groups the items into itemsets
- Generates rules using the itemsets
- When specific items appear, make use of the generated rules to predict the presence
of an item in the database

To generate rules based on a given frequent itemset X with support s as identified in the frequent itemset generation step. We first divide X into LHS and RHS to get the rule

LHS -> RHS

Example: {O2 Mini, Wireless Card} -> {Leather Pouch}, Confidence: 100%

In the SQL Server 2005 viewer, confidence is also referred to as the probability.

SQL Server Gems

Thursday, January 19, 2006

Association Rule Mining

0 Comments:

About

About Me

Theory and Practicals

Previous