Glossary


absolute error (aka laplacian error)	a loss function that can be used to measure error for a regression model [mean(abs(actual - prediction))]
argument	an input value for a function
bagging	an ensemble model, where a bootstrap sample of the training data is used to construct each member of the ensemble [the term "bag" is a contraction of "bootstrap aggregation"]
Bayes classifier	a classifier that uses prior, likelihood, and evidence values to estimate a posterior probability, where the predicted class is the class that has the largest posterior probability
Bayes decision boundary	the set of input values that partitions the input space into two or more distinct regions (where the maximum posterior probability is equal for two or more classes)
Bayes error rate	the error rate for a Bayes classifier
bias	for Y = f(X) + epsilon, the difference between f(x) and a prediction for Y
bias-variance trade-off	modifying the complexity of a model to minimize overall test error (where overall test error is composed of both bias and variance terms); decreasing bias increases variance, while decreasing variance increases bias
binary	a qualitative variable with two possible output values (e.g. positive or negative, "1" or "0", "1" or "-1")
boosting	an ensemble model, where each model added to the ensemble reduces the error produced by existing ensemble members (improving the performance of the overall ensemble)
boxplot	a one-dimensional graph that uses a box, whiskers, and outlier points to characterize the distribution of values for a quantitative variable [the lower bound of the box is the 25th percentile; the middle of the box is the 50th percentile (median); the upper bound of the box is the 75th percentile; the length of the whisker is a multiple of the interquartile range (75th - 25th percentile)]
categorical	another name for qualitative
class	a label for a group
classification	predicting a qualitative output value
cluster analysis	constructing a model to map input observations to groups (e.g. customer segmentation)
conditional probability	a posterior probability (a probability conditioned on an "evidence" expression); e.g. Pr(Y = j \| X = x) [read as "the probability that the value of variable Y is equal to j given that value of variable X is equal to x"]
contour plot	a two-dimesional graph of three-dimensional data, where a (contour) line indicates the connected points have the same value for the third dimension
cross validation	a form of resampling used for model selection
data frame	a data set organized into rows (observations) and columns (variables); may contain both quantitative and qualitative variables
degrees of freedom	a quantity that summarizes the flexibility of a model
dependent variable	another name for an output variable
endogenous variable	another name for an output variable
error rate	the proportion of classification model predictions that are incorrect
error term	in Y = f(X) + epsilon, this is the epsilon (irreducible error)
exogenous variable	another name for an input variable
expected test Mean Squared Error (MSE)	another name for test MSE
expected value	the average value for a variable [mean(x)]
feature	another name for an input variable
fit	another name for constructing (training) a model
flexible	able to model non-linear input-to-output mappings
function	a mapping from input values to output values
generalized additive model	an ensemble model where the output is the sum of component model predictions
heatmap	a two-dimensional graph of three-dimensional data, where points with the same color have the same value for the third dimension (e.g. blue for smaller values, red for larger values)
histogram	a bar chart for a quantitative variable, where the width of the bar identifies an interval for values and the height of the bar identifies the quantity of observations within the interval
independent variable	another name for an input variable
indicator variable	a variable that takes on the value 1 if an expression is true and 0 if the expression is false
input variable	a variable that is passed to a model to produce output
irreducible error	error that cannot be reduced by improving a predictive model
k-nearest neighbors	the "k" observations in the training data which are closest to an observation from the test data
least absolute shrinkage and selection operator (lasso)	a form of regularization using an l1 penalty on model coefficients
least squares	a simple algorithm for constructing a linear regression model [solve(t(X) %% X, diag(ncol(X))) %% t(X) %*% y]
linear model	a vector of coefficients, used to map an input vector to an output value via an inner product operation
logistic regression	a generalized linear model used for classification, where the linear model predicts the log odds of class membership (which is then mapped to the probability of class membership, using the logistic function)
machine learning	using data to create a model to map one-or-more input values to one-or-more output variables
matrix	an array with two indices
mean squared error (aka gaussian error)	a loss function that can be used to measure error for a regression model [mean((actual - prediction)^2)]
noise	another name for irreducible error
non-parameteric	a model where the size of the model is variable; i.e. the "size" of the model can grow with the size of the training data
output variable	a variable that is produced by a model
overfitting	fitting "noise" in the training data; i.e. decreasing training set error while increasing testing set error
parametric	a model where the size of the model is fixed; i.e. the "size" of the model does not grow with the size of the training data
predictor	another name for an input variable
qualitative	a categorical value that identifies a quality (e.g. gender)
quantitative	a numeric value that measures quantity (e.g. height)
reducible error	error that can be reduced by improving a predictive model (e.g. adding useful input variables)
regression	predicting a quantitative output value
response	another name for an output variable
scalar	a single numeric value
scatterplot	a two-dimensional graph that plots points for coordinates observed in a data set
scatterplot matrix	a matrix that consists of the pairwise scatterplots for a set of quantitative variables
semi-supervised learning	using both labeled and unlabeled observations to construct a model (output values are provided for only a subset of the training data)
smoothing spline	a spline regression model that supports non-linear regression, by penalizing model complexity
supervised	a learning algorithm is provided both input and output values for training the model (e.g. regression, classification)
support vector machine	a type of classification or regression model, where prototypical observations (support vectors) from the training data are used to make predictions
systematic	in Y = f(X) + epsilon, this is the f()
target	another name for an output variable
tensor	an array with more than two indices
test data	data that is used to evaluate a model, but was not used to construct the model
test Mean Squared Error (MSE)	MSE for test data
testing error	the error rate for the test data for a classification model
thin-plate spline	a form of smoothing spline that supports non-linear regression
training	the process of constructing a model
training data	data that is used to construct a model
training error	the error rate for the training data for a classification model
training Mean Squared Error (MSE)	MSE for the training data
unsupervised	a learning algorithm is provided only input values (not output values) for training the model (e.g. dimensionality reduction, clustering)
variable	a symbol that represents an attribute with more than one possible value
variance	the expected value of the squared deviation from the mean for a variable [mean((x - mean(x))^2)]
vector	an array with one index
workspace	the set of variables currently defined for your R environment