
Does not look at the outcome values, so does not require extra care in cross-validation. vtreat::designTreatmentsZ has a number of useful properties: We usually forget to teach vtreat::designTreatmentsZ() as it is often dominated by the more powerful y-aware methods vtreat supplies (though not for this simple example). Also note: differences observed in performance that are below the the sampling noise level should not be considered significant (e.g., all the methods demonstrated here performed about the same). As we said: xgboost requires a numeric matrix for its input, so unlike many R modeling methods we must manage the data encoding ourselves (instead of leaving that to R which often hides the encoding plan in the trained model). Note: we are not working hard on this example (as in adding extra variables derived from cabin layout, commonality of names, and other sophisticated feature transforms)- just plugging the obvious variable into xgboost. Let’s try the Titanic data set to see encoding in action. To make this concrete let’s work a simple example. This requires explicit conversion on the part of the R user, and many R users get it wrong (fail to store the encoding plan somewhere). One such system is xgboost which requires (as is typical of machine learning in scikit-learn) data to already be encoded as a numeric matrix (instead of a heterogeneous structure such as a ame). The main place an R user needs a proper encoder (and that is an encoder that stores its encoding plan in a conveniently re-usable form, which many of the "one-off ported from Python" packages actually fail to do) is when using a machine learning implementation that isn’t completely R-centric. Python scikit-learn users coming to R often ask "where is the one-hot encoder" (as it isn’t discussed as much in R as it is in scikit-learn) and even supply a number of (low quality) one-off packages "porting one-hot encoding to R." Many R users are not familiar with the above issue as encoding is hidden in model training, and how to encode new data is stored as part of the model. The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data).
Stats::model.matrix(~x, dTest) # (Intercept) xc