Decision Tree (Part 3a)
Decision trees classify instances by sorting them from the root of the tree to some leaf nodes (which provides the classification of the instance). An example of a decision tree is shown below.

For example, we can make use of a decision tree above to determine whether we should approve the credit card application for a user. First, we check what is the outstanding debt of the person, and use this as the criteria to determine what other factors (i.e. attributes) to check further down the tree.. The process continues until we reach a leaf node (i.e. last node in the tree), which tells us whether to approve/dis-approve the application.
An interesting question is how is the decision tree generated, and who determines which attribute to be at the top of the tree? If the amount of data is small, we could probably eyeball and draw the tree manually. However, if the amount of data is large, we will need automatic techniques to generate the trees.
In the research community, many decision trees algorithms have been proposed over the years. These includes (non-exhaustive) ID3 (classic), C4.5 / C5.0, and many more. Some of the source code for these (including visualization tools) are publicly available.
In fact, decision tree algorithms have been in SQL Server 2000. However, due to the lack of visualization tools, it is seldom used. In SQL Server 2005, one can now create an Analysis Project and make use of graphical tools to create/manipulate/visualize the models. This is great productivity boost!
In SQL Server 2005, the decision tree makes use of 3 types of scoring methods:
- Entropy (1)
- Bayesian with K2 Prior (3)
- Bayesian Dirichlet Equivalent with Uniform Prior (4)
Interestingly, method 2 (orthogonal) has been dropped. The scoring methods is used to evaluate which attributes to place at the top of the decision trees and which attributes to be at subsequent level. Out of the 3 types of scoring methods, the Entropy method is easiest to understand, and I will use this to illustrate how attributes are chosen in Part 3b.

For example, we can make use of a decision tree above to determine whether we should approve the credit card application for a user. First, we check what is the outstanding debt of the person, and use this as the criteria to determine what other factors (i.e. attributes) to check further down the tree.. The process continues until we reach a leaf node (i.e. last node in the tree), which tells us whether to approve/dis-approve the application.
An interesting question is how is the decision tree generated, and who determines which attribute to be at the top of the tree? If the amount of data is small, we could probably eyeball and draw the tree manually. However, if the amount of data is large, we will need automatic techniques to generate the trees.
In the research community, many decision trees algorithms have been proposed over the years. These includes (non-exhaustive) ID3 (classic), C4.5 / C5.0, and many more. Some of the source code for these (including visualization tools) are publicly available.
In fact, decision tree algorithms have been in SQL Server 2000. However, due to the lack of visualization tools, it is seldom used. In SQL Server 2005, one can now create an Analysis Project and make use of graphical tools to create/manipulate/visualize the models. This is great productivity boost!
In SQL Server 2005, the decision tree makes use of 3 types of scoring methods:
- Entropy (1)
- Bayesian with K2 Prior (3)
- Bayesian Dirichlet Equivalent with Uniform Prior (4)
Interestingly, method 2 (orthogonal) has been dropped. The scoring methods is used to evaluate which attributes to place at the top of the decision trees and which attributes to be at subsequent level. Out of the 3 types of scoring methods, the Entropy method is easiest to understand, and I will use this to illustrate how attributes are chosen in Part 3b.
0 Comments:
Post a Comment
<< Home