SQL Server Gems

Sunday, March 12, 2006

Decision Trees (Part 3c) - Entropy and Information Gain

If the attribute to be predicted can take on c possible values, then the entropy of D relative to this c-wise classification is defined as:



where pi is the proportion of D belonging to class i.

To measure the effectiveness of an attribute in classifying the training data

- Measure the expected reduction in entropy caused by partitioning the examples according to the attribute

- Denoted as Information Gain

The Information Gain of an attribute A over a set of data D is defined as



where Values(A) is the set of all possible values for attribute A, and Dv is the subset of D for which attribute A has value v.

Suppose D consists of training examples described by attribute PropertyType (HDB or Private) Assume D has 14 examples, [9+,5-] Suppose 6 of the positive and 2 of the negative have PropertyType = HDB, and the remainder has PropertyType = Private.

In summary,
D = [9+, 5-]
Dweak = [6+, 2-]
Dstrong = [3+, 3-]

Then,
Gain (D,PropertyType)
= Entropy(D) – (8/14)Entropy(Dweak) – (6/14) Entropy(Dstrong)
= 0.940 – (8/14) 0.811 – (6/14) 1.00
= 0.048

0 Comments:

Post a Comment

<< Home