absolute error (aka laplacian error) | a loss function that can be used to measure error for a regression model [mean(abs(actual - prediction))] |
argument | an input value for a function |
bagging | an ensemble model, where a bootstrap sample of the training data is used to construct each member of the ensemble [the term "bag" is a contraction of "bootstrap aggregation"] |
Bayes classifier | a classifier that uses prior, likelihood, and evidence values to estimate a posterior probability, where the predicted class is the class that has the largest posterior probability |
Bayes decision boundary | the set of input values that partitions the input space into two or more distinct regions (where the maximum posterior probability is equal for two or more classes) |
Bayes error rate | the error rate for a Bayes classifier |
bias | for Y = f(X) + epsilon, the difference between f(x) and a prediction for Y |
bias-variance trade-off | modifying the complexity of a model to minimize overall test error (where overall test error is composed of both bias and variance terms); decreasing bias increases variance, while decreasing variance increases bias |
binary | a qualitative variable with two possible output values (e.g. positive or negative, "1" or "0", "1" or "-1") |
boosting | an ensemble model, where each model added to the ensemble reduces the error produced by existing ensemble members (improving the performance of the overall ensemble) |
boxplot | a one-dimensional graph that uses a box, whiskers, and outlier points to characterize the distribution of values for a quantitative variable [the lower bound of the box is the 25th percentile; the middle of the box is the 50th percentile (median); the upper bound of the box is the 75th percentile; the length of the whisker is a multiple of the interquartile range (75th - 25th percentile)] |
categorical | another name for qualitative |
class | a label for a group |
classification | predicting a qualitative output value |
cluster analysis | constructing a model to map input observations to groups (e.g. customer segmentation) |
conditional probability | a posterior probability (a probability conditioned on an "evidence" expression); e.g. Pr(Y = j | X = x) [read as "the probability that the value of variable Y is equal to j given that value of variable X is equal to x"] |
contour plot | a two-dimesional graph of three-dimensional data, where a (contour) line indicates the connected points have the same value for the third dimension |
cross validation | a form of resampling used for model selection |
data frame | a data set organized into rows (observations) and columns (variables); may contain both quantitative and qualitative variables |
degrees of freedom | a quantity that summarizes the flexibility of a model |
dependent variable | another name for an output variable |
endogenous variable | another name for an output variable |
error rate | the proportion of classification model predictions that are incorrect |
error term | in Y = f(X) + epsilon, this is the epsilon (irreducible error) |
exogenous variable | another name for an input variable |
expected test Mean Squared Error (MSE) | another name for test MSE |
expected value | the average value for a variable [mean(x)] |
feature | another name for an input variable |
fit | another name for constructing (training) a model |
flexible | able to model non-linear input-to-output mappings |
function | a mapping from input values to output values |
generalized additive model | an ensemble model where the output is the sum of component model predictions |
heatmap | a two-dimensional graph of three-dimensional data, where points with the same color have the same value for the third dimension (e.g. blue for smaller values, red for larger values) |
histogram | a bar chart for a quantitative variable, where the width of the bar identifies an interval for values and the height of the bar identifies the quantity of observations within the interval |
independent variable | another name for an input variable |
indicator variable | a variable that takes on the value 1 if an expression is true and 0 if the expression is false |
input variable | a variable that is passed to a model to produce output |
irreducible error | error that cannot be reduced by improving a predictive model |
k-nearest neighbors | the "k" observations in the training data which are closest to an observation from the test data |
least absolute shrinkage and selection operator (lasso) | a form of regularization using an l1 penalty on model coefficients |
least squares | a simple algorithm for constructing a linear regression model [solve(t(X) %*% X, diag(ncol(X))) %*% t(X) %*% y] |
linear model | a vector of coefficients, used to map an input vector to an output value via an inner product operation |
logistic regression | a generalized linear model used for classification, where the linear model predicts the log odds of class membership (which is then mapped to the probability of class membership, using the logistic function) |
machine learning | using data to create a model to map one-or-more input values to one-or-more output variables |
matrix | an array with two indices |
mean squared error (aka gaussian error) | a loss function that can be used to measure error for a regression model [mean((actual - prediction)^2)] |
noise | another name for irreducible error |
non-parameteric | a model where the size of the model is variable; i.e. the "size" of the model can grow with the size of the training data |
output variable | a variable that is produced by a model |
overfitting | fitting "noise" in the training data; i.e. decreasing training set error while increasing testing set error |
parametric | a model where the size of the model is fixed; i.e. the "size" of the model does not grow with the size of the training data |
predictor | another name for an input variable |
qualitative | a categorical value that identifies a quality (e.g. gender) |
quantitative | a numeric value that measures quantity (e.g. height) |
reducible error | error that can be reduced by improving a predictive model (e.g. adding useful input variables) |
regression | predicting a quantitative output value |
response | another name for an output variable |
scalar | a single numeric value |
scatterplot | a two-dimensional graph that plots points for coordinates observed in a data set |
scatterplot matrix | a matrix that consists of the pairwise scatterplots for a set of quantitative variables |
semi-supervised learning | using both labeled and unlabeled observations to construct a model (output values are provided for only a subset of the training data) |
smoothing spline | a spline regression model that supports non-linear regression, by penalizing model complexity |
supervised | a learning algorithm is provided both input and output values for training the model (e.g. regression, classification) |
support vector machine | a type of classification or regression model, where prototypical observations (support vectors) from the training data are used to make predictions |
systematic | in Y = f(X) + epsilon, this is the f() |
target | another name for an output variable |
tensor | an array with more than two indices |
test data | data that is used to evaluate a model, but was not used to construct the model |
test Mean Squared Error (MSE) | MSE for test data |
testing error | the error rate for the test data for a classification model |
thin-plate spline | a form of smoothing spline that supports non-linear regression |
training | the process of constructing a model |
training data | data that is used to construct a model |
training error | the error rate for the training data for a classification model |
training Mean Squared Error (MSE) | MSE for the training data |
unsupervised | a learning algorithm is provided only input values (not output values) for training the model (e.g. dimensionality reduction, clustering) |
variable | a symbol that represents an attribute with more than one possible value |
variance | the expected value of the squared deviation from the mean for a variable [mean((x - mean(x))^2)] |
vector | an array with one index |
workspace | the set of variables currently defined for your R environment |