SQL Server Gems

Thursday, February 16, 2006

Feature Selection

An interesting discussion on Feature Selection in SQL Server 2005 by Peter Kim.

"....

These are parameters for automatic feature selection for algorithms. Depending on the algorithm, the feature selection algorithm may be different. For Naive Bayesian and Clustering, we use entropy-based interestingness score, which tells how an attribute would be "interesting". For instance, customer phone numbers wouldn't be interesting than gender. The interestingness score is calculated as I(A) = -(m - E(A))^2, E(A) = - sum_i(pi * ln(pi)), where m is a magic number.

For Decision trees, we use the same interestingness score for output attribute feature selection. Then, we calculate split score for each input attribute vs. the selected output attributes. The input feature selection is based on the calculated split score. This will effectively say which input attributes are worth to consider, which ones are not, based on selected output attributes. "

0 Comments:

Post a Comment

<< Home