top of page

Linear Regression

What is Linear Regression?

Welcome to the beginning of your machine learning journey! Linear Regression is the simplest way to identify patterns and make meaningful predictions based on data. For example, is there a relationship between the amount of time you study and your test scores?

Here we have performed linear regression on this set of data to identify a clear relationship: The more you study, the higher the test score!

​

In this lesson, you will learn how to perform linear regression yourself and analyze data like a true data scientist!

Understanding Data: Points and Patterns

(If you know how to graph data in the coordinate plane, skip this section)

What is data?

​

Data is information – usually in the form of numbers. We can gather data experimentally and analyze it. Data is made up of an independent variable (X) and a dependent variable (Y). Variable means subject to change, which means that the data we collect can vary. In good data, Y depends on X, or in other words, a change in X results in a change in Y. If a change in X does not affect Y, then our data is meaningless and has no pattern to analyze.

 

For example, in the example above, Time Studied is the independent variable and Test Score is the dependent variable (Time Studied = X, Test Score = Y). They are variables because the numbers vary for each person – one person studied for 6 hours and got a score of 72%, whereas another person studied for 1 hour and got a 50%. The amount of time studied affects test score, so test score depends on time studied, not the other way around. That is why Time Studied is independent and Test Score is dependent in this example.

 

Now that we know the dependent and independent variables of our data, we can gather data points. This is generally done experimentally in real studies. A single data point is written like so:

​

​

(x, y)

Value of the independent variable

Value of the dependent variable

To gather data for analysis or predictions, we need a lot of data points. We organize large numbers of data points into data tables. Here is a portion of the data table for the Time Studied vs Test Score data:

Time Studied
Test Score
15
95
14
92
12
90
11
87
10
85

Each row is a data point, i.e. (10, 85) means 10 hour studied with a resulting score of 85.

Now that we have our table, we can plot this data on a coordinate plane. Here is a fun activity to teach you about data plotting!

​

< PLOTTING MINIGAME >

< MORE REAL WORLD EXAMPLES >

Drawing the Line of Best Fit

Congratulations! You are now ready to perform linear regression on selected data. The central idea behind linear regression is drawing a line of best fit through our plotted data. Literally think of it as drawing a line yourself through the graph!

​

Why can’t you just connect each data point on the graph? For a computer to understand the data and make predictions later on, it needs to find a general trend, such as a linear increase / decrease, or more complicated trends which you will learn about later. So, the computer needs to apply a mathematical model to the data – usually in the form of a function that maps x to an output y. Here are some examples of functions that can be used for regression:​​

You can actually fit any kind of curve you want, but you must choose the curve that best fits that specific dataset. The simplest curve to fit to data is a straight line, which actually is not a very good model for real world situations. There are more specific forms of regression for different situations, such as logistic regression and polynomial regression, which you will learn about later!

​

This minigame will help teach you about fitting lines to data:

​

< Fitting minigame: Line it up! >

The Math Behind It - Level 1

Let’s start with the math behind a simple line. Here is the equation of a line:
 

y = mx + b

slope

y-intercept

We already know x and y, the independent and dependent variables! But what are m and b?

​

m is the slope of the line – the change in y for every increase by 1 in x. In other words, when we increase x by 1, how much does y change? That is m! The larger m is, the steeper the line. What if m is less than 0? The line points down!

​

b is the y-intercept of the line – the location on the graph where the line intercepts the y-axis! Since the y-intercept is on the y-axis, the point is of the form (0, b).

​

Play around with the sliders in the following interactive to fully understand m and b:

The Math Behind It - Level 2
bottom of page