XGBoost applied to Iris dataset

Let’s show here a very simple example of applying XGBoost to Iris dataset. Initial Iris dataset is at UCI data repository. But we will use ready-to-use Iris dataset contained in sklearn.

Run Jupyter with Docker and install XGBoost

Let’s run our data science Docker, as described here: Run Jupyter notebooks with Docker. Run command:

docker run --rm -p 8888:8888 --name myds1 jupyter/scipy-notebook:latest

But this Docker image does not contain XGBoost, so let’s manually install it. After container is started, let’s exec following command to install XGBoost:

docker exec -it myds1 pip install xgboost

Load Iris dataset, split it to train and test

Load necessary imports, load data and split dataset:

import xgboost
from sklearn import datasets
from sklearn.cross_validation import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train XGBoost, predict data and compare

XGBoost needs that Numpy arrays be loaded in special DMatrix format:

dtrain = xgboost.DMatrix(X_train, label=y_train)
dtest = xgboost.DMatrix(X_test, label=y_test)

Then let’s set up XGBoost params:

param = {
    'max_depth': 3,                 # the maximum depth of each tree
    'eta': 0.3,                     # the training step for each iteration
    'silent': 1,                    # logging mode - quiet
    'objective': 'multi:softmax',   # multiclass classification using the softmax objective
    'num_class': 3                  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

More about XGBoost params can be found here: XGBoost Parameters.

Now train our model. And, if you want to look at how XGBModel looks like, dump it at text file and then simply take a look at it.

bstmodel = xgboost.train(param, dtrain, num_round)

Then predict on test data and calculate accuracy score:

preds = bst.predict(dtest)

from sklearn import metrics
acc = metrics.accuracy_score(y_test, preds2)

That is it!