Linear regression is a supervised machine learning algorithm used to model linear relationships from labeled datasets. The model created allows us to make predictions on a different dataset. To use it, there must be a linear relationship between input and output, meaning the output must change at a constant rate as the input changes. This way, we can separate the data with a straight line.
In the first graph here, we can separate our dataset because it is linear, but we cannot separate the parabola in the second example.
Variables and Terminology
In statistics literature, input values are called predictors or more commonly, independent variables. Outputs are used as dependent variables.
For example: Let’s say we have a dataset that records how many hours students studied and their exam scores.
Student Study and Exam Score Data
From the data, we can see that as study hours increase, the score also rises. In this example, the exam score prediction is made based on study hours.
- Independent Variable: Study hours because we can control and observe it.
- Dependent Variable: Exam score because it depends on how many hours were studied.
We use the independent variable (input) to predict the dependent variable (output). If there is one dependent and one independent variable as in this example, it is called simple linear regression.
In our first example, is a simple regression example. Because our independent variable is only x.
In the second example, in the formula , x and z are independent variables, so there are 2 independent variables in total.
Therefore, the first example is simple, and the second is multiple linear regression.
If there are multiple independent variables (input) and we are trying to predict multiple dependent (output) variables, it is called Multivariate Linear Regression. For example, let’s say we have the following dataset:
Student Study and Exam Score Data
Here we are trying to predict math and physics exam results based on study hours and the number of practice tests taken.
This gives us two different equations, and this is called multivariate linear regression.
How It Works
In Linear Regression, relationships are modeled using linear predictor functions from known data, as we have shown in previous examples. The model answers the question “What is the expected output under these conditions?” by looking at the inputs we have, rather than the probability distribution of all variables.
Linear regression is historically the first type of regression analysis that has been rigorously studied in statistics and widely used in practical applications. The main reason for this is that models that are linearly dependent on their unknown parameters are much easier to mathematically solve and fit compared to non-linear models.
Its Place in Machine Learning
Linear regression is not just a classic statistical method, but also a fundamental machine learning algorithm. More specifically, it falls under the category of supervised learning.
As in our student score examples, we provide the model with both the inputs (study hours) and the correct answers, i.e., labels (exam scores). The algorithm learns from this labeled dataset and maps the data points to the most optimized linear function that can be used to make predictions on new, previously unseen data.
Key Use Cases
The practical applications of linear regression are generally divided into two main categories:
- Prediction and Forecasting: If our goal is to predict a future or unavailable value by minimizing error, linear regression is an excellent tool. A predictive model is trained with existing data. Then, for a new situation where we only have input values (e.g., study hours), the output (exam score) is predicted.
- Explaining Relationships and Understanding Variance: Sometimes the purpose is not just to make predictions, but to numerically measure the strength of the relationship between variables. For example, it is used to find how much change in independent variables leads to change in the dependent variable, to prove that some variables have no relevance to the result, or to identify which variables contain redundant/duplicate information with each other.