Trees work pretty well

From Kaggle newsletter:

Kaggle has a new #1 ranked data scientist.Congratulations José Guerrero! He's worked in the health sector in Spain for more than 25 years, and is currently chomping at big databases at the region's main hospital. He has a BSc & MSc in Mathematics, Statistic and Operations Research and did his postgraduate work in Scientific Programming. Perhaps that's helped out on Kaggle ... José says, “My first option with a dataset is almost always tree-based (boosted or bagged). Trees are robust, manage unknown data well, and have ability for interaction modeling.” José mainly uses R and more recently Python with scikit-learn. And what is his view getting there at the top? “I learn in every challenge, and the community interaction is really amazing.”

When I have a big database or I want to know if the variables I have are able to explain variation in the response variable, one of the things I might do first is to apply a random forest. Brief data exploration and then I move to the actual modeling. See here.

