Models
Python models
scikit-learn has on the order of 100 to 200 models (more generally called "estimators"), split into three categories:
- Supervised Learning (linear regression, support vector machines, random forest, neural nets, ...)
- Unsupervised Learning (clustering, PCA, mixture models, manifold learning, ...)
- Dataset Transformation (preprocessing, text feature extraction, one-hot encoding, ...)
All of those estimators will work with ScikitLearn.jl. They are imported with @sk_import
. For example, here's how to import and fit sklearn.linear_regression.LogisticRegression
julia> using ScikitLearn, Random
julia> Random.seed!(11); #ensures reproducibility
julia> X = rand(20,3); y = rand([true, false], 20);
julia> @sk_import linear_model: LogisticRegression
PyObject <class 'sklearn.linear_model._logistic.LogisticRegression'>
julia> using ScikitLearn.CrossValidation: train_test_split
julia> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42);
julia> log_reg = fit!(LogisticRegression(penalty="l2"), X_train, y_train);
julia> predict(log_reg, X_test)
5-element Array{Bool,1}:
0
0
0
0
0
Reminder: ?LogisticRegression
contains a lot of information about the model parameters.
Installation and importing Python models
Importing the Python models requires Python 3.x with numpy, and the scikit-learn library. This is easiest to get through Conda.jl, which is already installed on your system. Calling @sk_import linear_model: LinearRegression
should automatically install everything. You can also install scikit-learn
manually with Conda.add("scikit-learn")
. If you have other issues, please refer to PyCall.jl, or post an issue
Julia models
Julia models are hosted in other packages, and need to be installed separately with Pkg.add
or Pkg.checkout
(to get the latest version - sometimes necessary). They all implement the common api, and provide hyperparameter information in their ?docstrings
.
Unfortunately, some packages export a fit!
function that conflicts with ScikitLearn's fit!
. This can be fixed by adding this line:
using ScikitLearn: fit!, predict
ScikitLearn in-built models
ScikitLearn.Models.LinearRegression()
implements linear regression using\
, optimized for speed. See?LinearRegression
for fitting options.
GaussianMixtures.jl
julia> using GaussianMixtures: GMM #remember to install package first
julia> gmm = fit!(GMM(n_components=3, kind=:diag), X_train);
[ Info: Initializing GMM, 3 Gaussians diag covariance 3 dimensions using 15 data points
Iters objv objv-change | affected
-------------------------------------------------------------
0 1.462249e+00
1 1.041033e+00 -4.212161e-01 | 2
2 9.589243e-01 -8.210827e-02 | 2
3 9.397430e-01 -1.918135e-02 | 0
4 9.397430e-01 0.000000e+00 | 0
K-means converged with 4 iterations (objv = 0.9397430000827904)
┌ Info: K-means with 15 data points using 4 iterations
└ 1.3 data points per parameter
julia> predict_proba(gmm, X_test)
5×3 Array{Float64,2}:
1.37946e-7 5.58899e-9 1.0
0.986895 1.98749e-10 0.0131053
0.998037 1.00296e-15 0.00196321
2.66238e-11 0.041746 0.958254
0.999984 4.05443e-6 1.16204e-5
Documentation at GaussianMixtures.jl. Example: density estimation
GaussianProcesses.jl
julia> using GaussianProcesses: GPE, MeanZero, SE #remember to install package first
julia> gp = fit!(GPE(; mean=MeanZero(), kernel=SE(0.0, 0.0), logNoise=-1e8), X_train, Float64.(y_train))
GP Exact object:
Dim = 3
Number of observations = 15
Mean function:
Type: MeanZero, Params: Float64[]
Kernel:
Type: GaussianProcesses.SEIso{Float64}, Params: [0.0, 0.0]
Input observations =
[0.376913304113047 0.5630896022795546 … 0.31598998347835017 0.5828199336036355; 0.50060556533132 0.4124482236437548 … 0.6750380496244157 0.6147514739028759; 0.5142063690337368 0.4774433498612982 … 0.9823652195180261 0.21010382988916376]
Output observations = [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0]
Variance of observation noise = 0.0
Marginal Log-Likelihood = -749.473
julia> predict(gp, X_test)
5-element Array{Float64,1}:
2.1522493172861914
1.298965158588743
0.8142639915894847
-0.7287701449385509
0.749523596826748
Documentation at GaussianProcesses.jl and in the ?GPE
docstring. Example: Gaussian Processes
Gaussian Processes have a lot of hyperparameters, see get_params(GP)
for a list. They can all be tuned
DecisionTree.jl
DecisionTreeClassifier
DecisionTreeRegressor
RandomForestClassifier
RandomForestRegressor
AdaBoostStumpClassifier
Documentation at DecisionTree.jl. Examples: Classifier Comparison, Decision Tree Regression notebooks.
LowRankModels.jl
SkGLRM
: Generalized Low Rank ModelPCA
: Principal Component AnalysisQPCA
: Quadratically Regularized PCARPCA
: Robust PCANNMF
: Non-negative matrix factorizationKMeans
: The k-means algorithm
These algorithms are all special cases of the Generalized Low Rank Model algorithm, whose main goal is to provide flexible loss and regularization for heterogeneous data. Specialized algorithms will achieve faster convergence in general. Documentation at LowRankModels.jl. Example: KMeans Digit Classifier.
Contributing new models
To make your Julia model compatible with ScikitLearn.jl, you need to implement the scikit-learn interface. See ScikitLearnBase.jl