Quick start guide

Quick Example

Let's build a classifier for the classic iris dataset. If you don't have RDatasets, Pkg.add it.

julia> using RDatasets: dataset

julia> iris = dataset("datasets", "iris");

julia> first(iris, 5)
5×5 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│     │ Float64     │ Float64    │ Float64     │ Float64    │ Cat…    │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa  │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa  │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa  │

ScikitLearn.jl expects arrays, but DataFrames can also be used - see the corresponding section of the manual

julia> X = convert(Array, iris[!, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]]) #subset of original data not a copy
150×4 Array{Float64,2}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 ⋮
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8

julia> y = convert(Array, iris[!, :Species])
150-element Array{String,1}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

Next, we load the LogisticRegression model from scikit-learn's library. This would require scikit-learn. See Installation

julia> using ScikitLearn

julia> @sk_import linear_model: LogisticRegression
PyObject <class 'sklearn.linear_model._logistic.LogisticRegression'>

Every model's constructor accepts hyperparameters (such as regression strength, whether to fit the intercept, the penalty type, etc.) as keyword arguments. Check out ?LogisticRegression for details.

julia> model = LogisticRegression(fit_intercept=true, max_iter = 200)
PyObject LogisticRegression(max_iter=200)

Then we train the model and evaluate its accuracy on the training set:

julia> fit!(model, X, y);

julia> accuracy = score(model, X, y)
0.9733333333333334

julia> println("accuracy: $accuracy")
accuracy: 0.9733333333333334

Cross-validation

This will train five models, on five train/test splits of X and y, and return the test-set accuracy of each:

julia> using ScikitLearn.CrossValidation: cross_val_score

julia> cross_val_score(LogisticRegression(max_iter=130), X, y; cv=5)  # 5-fold
5-element Array{Float64,1}:
 0.9666666666666667
 1.0
 0.9333333333333333
 0.9666666666666667
 1.0

See this tutorial for more information.

Hyperparameter tuning

LogisticRegression has a regularization-strength parameter C (smaller is stronger). We can use grid search algorithms to find the optimal C.

GridSearchCV will try all values of C in 0.1:0.1:2.0 and will return the one with the highest cross-validation performance.

julia> using ScikitLearn.GridSearch: GridSearchCV

julia> gridsearch = GridSearchCV(LogisticRegression(max_iter=200), Dict(:C => 0.1:0.1:2.0))
GridSearchCV
  estimator: PyCall.PyObject
  param_grid: Dict{Symbol,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}
  scoring: Nothing nothing
  loss_func: Nothing nothing
  score_func: Nothing nothing
  fit_params: Dict{Any,Any}
  n_jobs: Int64 1
  iid: Bool true
  refit: Bool true
  cv: Nothing nothing
  verbose: Int64 0
  error_score: String "raise"
  scorer_: Nothing nothing
  best_params_: Nothing nothing
  best_score_: Nothing nothing
  grid_scores_: Nothing nothing
  best_estimator_: Nothing nothing

julia> fit!(gridsearch, X, y);

julia> println("Best parameters: $(gridsearch.best_params_)")
Best parameters: Dict{Symbol,Any}(:C => 0.6)

Finally, we plot cross-validation accuracy vs. C

julia> using PyPlot, Statistics

julia> plot([cv_res.parameters[:C] for cv_res in gridsearch.grid_scores_],
            [mean(cv_res.cv_validation_scores) for cv_res in gridsearch.grid_scores_])
1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x7f440fce0850>

Saving the model to disk

Both Python and Julia models can be saved to disk

julia> import JLD, PyCallJLD

julia> JLD.save("my_model.jld", "model", model)

julia> model = JLD.load("my_model.jld", "model")    # Load it back
┌ Warning: type PyCallJLD.PyObjectSerialization not present in workspace; reconstructing
└ @ JLD ~/.julia/packages/JLD/uVJmd/src/jld_types.jl:722
JLD.var"##PyCallJLD.PyObjectSerialization#253"(UInt8[0x80, 0x03, 0x63, 0x73, 0x6b, 0x6c, 0x65, 0x61, 0x72, 0x6e  …  0x2e, 0x32, 0x32, 0x2e, 0x31, 0x71, 0x43, 0x75, 0x62, 0x2e])