Provost’s 9 main model categories
clustering / segmentation (u)
classification (s)
regression (s)
similarity matching (s, u)
co-occurrence grouping (u)
profiling (u)
link prediction (s, u)
data reduction (s, u)
causal modeling
linear discriminant
a hyperplanar discriminant for a binary target variable will split the attribute phase space into 2 regions
fitting:
* we can apply entropy measure to the two resulting segments, to check for information gain (weighting each side by the number of instances in it)
* we can check the means of each of the classes along the hyperplane normal, and seek maximum inter-mean separation
probability estimation tree
a classification tree that may be considered a hybrid between classification and regression models
leaves are annotated with a category value, and a probability
decision tree (general)
for regression or classification
tunable via
support vector machines (linear)
simplest case involves a hyperplanar fitting surface, in combination with L2 regularization, and possibly a hinge loss function
via the kernel trick, more sophsiticated fitting surfaces can be used
support vectors consist of a subset of the training instances used to fit the model
logistic regression
aka logit model; typically used for modeling binary classification probabilities
in simplest form:
for a special logistic loss function, the loss surface is convex, allowing steepest descent
can be regularized on coefficients of linear kernel, via L1 and/or L2
offers a linear model’s interpretability, with a linear models drawbacks (eg collinearities)
hierarchical clustering
under some (cluster) metric, find the two closest clusters, and merge them; iterate
the cluster metric is called the linkage function
centroid clustering
each cluster is represented by its cluster center, or centroid
k-means method
choose starting centers for k clusters in the predictor phase space, then iterate (can be tuned over different k):
* assign each instance to the cluster it’s closest to
* calculate the centroid of each of the resulting clusters
naive Bayes
for classification
generative; features are considered for giving evidence for or against target variable values; each instance gets its own pdf
allows instant updating, with new data (Bayesian property)
relies on the class (c) as the prior, with the instance the conditioning event: p(E|C=c) = p(C=c|E)p(C=c) / p(E)
probability of class C=c, given instance E, where e_i are individual instance-predictor values or ranges:
p(E)
further simplified (with p(E) decomposed), to put in terms of predictor lift: p(c|E) = p(e_1|c)…p(e_k|c)p(c) / p(e_1)…p(e_k)
remove near- and zero-variance predictors; careful of few-unique-value predictors (give weird pdfs)
non-parametric regression models
generalized linear models
parsimonious model
a model that accomplishes the desired level of explanation or prediction with as few predictor variables as possible
linear regression (lr)
PCR (lr)
PLS (lr)
penalized / regularized linear regression models (lr)
three main flavors
* lasso: applies an L1 norm penalty on OLS regression coefficients; has the potential to fully remove predictors
* ridge: applies an L2 norm penalty on OLS regression coefficients
* elasticnet: combines L1 and L2 penalties
neural networks (nrc)
MARS (nr)
SVM (nr)
KNN (nrc)
decision trees (general / CART) (nrc)
conditional inference trees (nr)
regression model trees (nr)
rule based models (nrc)