What is a Linear Model?

The first step into the world of machine learning and statistics usually begins with Linear Models. This approach, which has been the most fundamental building block of data science for decades, allows us to express seemingly complex relationships in data with a simple and linear equation. At its core, what it does is very clear: It tries to predict the output value $Y$ given the input vector $X^T = (X_1, X_2, X_3 ...)$ :

\hat{Y} = \hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j

$\hat{\beta}_0$ is the bias (intercept) value. If we write this formula with $X$ as a vector:

\hat{Y} = X^{\top} \hat{\beta}

Understanding Dot Product

Let’s break down the formula step by step to better understand it:

\hat{Y}=\hat{\beta}_0 + X_1\hat{\beta}_1 + X_2\hat{\beta}_2 + \cdots + X_p\hat{\beta}_p

Here, each $X_j$ is a single feature value and is a scalar:

X_j \in \mathbb{R}

Similarly, each coefficient is also a scalar:

\hat{\beta}_j \in \mathbb{R}

Therefore, the product of two real numbers yields a single real number:

X_j \hat{\beta}_j \in \mathbb{R}

For example:

X_1=3,\quad \hat{\beta}_1=4

then:

X_1\hat{\beta}_1 = 12

is obtained.

This 12 is now a scalar; it is not a vector or matrix, but just a single number.

The same applies to all terms:

X_2\hat{\beta}_2,\; X_3\hat{\beta}_3,\; \ldots,\; X_p\hat{\beta}_p

each produces scalar values individually.

As a result, their sum is also a single number:

\hat{Y}\in \mathbb{R}

Therefore, the output of a linear model is a single scalar prediction. Each product is also a scalar, and this is called the dot product (inner product).

Including the Bias

Instead of writing the bias separately, we can extend the feature vector:

X = \begin{bmatrix} 1 \\ X_1 \\ X_2 \\ \vdots \\ X_p \end{bmatrix}, \quad \hat{\beta} = \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots \\ \hat{\beta}_p \end{bmatrix}

The reason we start with 1 is to be able to include the $\hat{\beta}_0$ value in the formula as well.

Why is Transpose Needed?

We cannot directly multiply two column vectors.

X = \begin{bmatrix} 1 \\ X_1 \\ X_2 \\ \vdots \\ X_p \end{bmatrix} = (p + 1)\times1, \quad \hat{\beta} = \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots \\ \hat{\beta}_p \end{bmatrix} = (p + 1)\times1

In dot product multiplication, the inner dimensions must be the same. Therefore, we need to write one of them as a row.

((p + 1)\times1) \times ((p + 1)\times1) \rightarrow (1\times(p + 1)) \times ((p + 1)\times1)

= 1\times1

So now:

X = \begin{bmatrix} 1 & X_1 & X_2 & \ldots & X_p \end{bmatrix} = 1\times(p + 1), \quad \hat{\beta} = \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots \\ \hat{\beta}_p \end{bmatrix} = (p + 1)\times1

Example Solution

Let there be two features:

X= \begin{bmatrix} 1\\ 3\\ 5 \end{bmatrix} ,\quad \hat{\beta}= \begin{bmatrix} 2\\ 4\\ 6 \end{bmatrix}

Then:

X^\top \hat{\beta} = \begin{bmatrix} 1 & 3 & 5 \end{bmatrix} \begin{bmatrix} 2\\ 4\\ 6 \end{bmatrix}

Result:

=1\cdot2 + 3\cdot4 + 5\cdot6

=44

Hyperplane

The geometric meaning of a linear model is a hyperplane. Let’s think with two features ( $X_1, X_2$ ) and one output ( $Y$ ). The predictions produced by our model ( $\hat{Y}$ ) form a flat surface, i.e., a plane (or a hyperplane in higher dimensions), extending along the $X_1$ and $X_2$ axes.

In the visualization below; the blue plane represents the hyperplane created by our linear model in space, the points represent actual data points, and the red dashed lines represent the distances of the actual data to our model (plane):

Each point is orthogonally projected onto the hyperplane. The difference between the actual value ( $Y$ ) and the model’s prediction on the plane ( $\hat{Y}$ ) is called the residual: $e_i = Y_i - \hat{Y}_i$ .

Note: How this plane is placed in space and how these residual errors are minimized is the subject of optimization methods. The linear model is simply the mathematical definition of this plane.

Coefficients and the Gradient Vector

When we think of the linear model’s equation as a function, $f(X) = X^\top \beta$ , what determines the model’s slope in space is the $\beta$ coefficients. Mathematically, the gradient of a function indicates the direction of steepest ascent at that point.

For a linear model, this gradient is directly equal to the coefficient vector:

f'(X) = \nabla f(X) = \beta

So the $\beta$ vector indicates the direction in which the model’s output ( $\hat{Y}$ ) increases fastest in the input space.

In the visualization above, the green arrow represents the $\nabla f$ (i.e., $\beta$ ) vector, and the orange dot represents the current position $X$ . Since the slope of the plane is constant in a linear model, this direction is the same everywhere. The algorithms we use when training models (e.g., Gradient Descent) use these vectorial properties to bring the model to the most accurate position.