• LOGIN
  • No products in the cart.

Chapter 6: Getting Started With Linear Regression In R

Getting Started With Linear Regression In R

We live in an information-driven world where data reigns supreme. Unsurprisingly, we must assess relevant data to make critical business decisions. One of the more popular data analysis techniques is regression. The discipline of machine learning is expanding, and with it, a popular algorithm: linear regression. This post will teach you about linear regression and how it works in R.

Why Linear Regression?

Before delving into linear regression, consider the requirement for a linear regression process using an analogy.

Assume we were asked to forecast the number of skiers at a resort based on snowfall in the vicinity. The simplest method would be to draw a basic graph with snowfall quantities and skiers on the ‘X’ and ‘Y’ axes. Then, based on the graph, we may deduce that as the amount of snowfall grew, so will the number of skiers.

As a result, the graph makes the link between skiers and snowfall clear. The number of skiers grows according to the amount of snowfall. Therefore, we can make better judgments on ski area operations based on the information provided by the graph.

To grasp linear regression, we must first understand “regression.” Regression is a technique for determining correlations between a dependent variable (Y) and many independent variables (X). The independent variables are predictors or explanatory factors in this context, while the dependent variable is a response or target variable.

A linear regression equation looks like this:

y = B0 + B1x1 + B2x2 + B3x3 + ….

Where B0 is the intercept(value of y when x=0),

B1, B2, and B3 are the slopes.

x1, x2, and x3 independent variables.

Snowfall is an independent variable in this situation, whereas the number of skiers is a dependent variable. So, given that regression identifies correlations between dependent and independent variables, what precisely is linear regression?

What is Linear Regression?

Linear regression is a statistical analytic technique demonstrating the link between two or more continuous variables. It develops a prediction model based on relevant data to demonstrate patterns. To develop the model, analysts often employ the “least square approach.” Other approaches exist, but the least square method is often used.

The graph below demonstrates the link between a sample of people’s heights and weights. The red line is the linear regression that reveals a person’s height is positively connected to weight.

Now that we know what linear regression is, let’s look at how it works and how we can use the linear regression formula to get the regression line.

How Does Linear Regression Work?

We can better understand how linear regression works by considering the example of a dataset with two fields, Area and Rent, and is used to forecast the rent of a property depending on its location. The dataset is as follows:

  1. We create a graph using the supplied data, with Area on the X-axis and Rent on the Y-axis. The graph will look somewhat like this. Take note of the linear pattern with a small dip.
  2. The mean of Area and Rent is then calculated.
  3. The mean is then plotted on the graph.
  4. We draw a best-fit line that goes through the mean.
  5. But there’s an issue. Multiple lines can be drawn across the mean, as seen below:
  6. To solve this issue, we keep adjusting the line until the best-suited line has the shortest square distance from the data points.
  7. Adding the squares of the residuals yields the least-square distance.
  8. We can now deduce that Residual is the difference between Y-actual and Y-pred.
  9. These formulae may be used to get the values of m and c for the best fit line, y = mx + c:
  10. This assists us in determining the relevant values:
  1. We can then calculate the values of m and c.
  2. We can now calculate the value of Y-pred.
  3. We calculate that the least square value for the following line is 3.02.
  4. Finally, we may plot the Y-pred, which is determined to be the best fit line.

This demonstrates the linear regression technique in action. Let us now go on to our use case.

Use Case of revenue prediction, featuring linear regression

Using a linear regression model in R, predict income from paid, organic, and social media visits.

We will now examine a real-world scenario in which we will estimate income using regression analysis in R. The following is an example dataset with which we will be working:

In this demonstration, we will use the following three attributes to forecast revenue:

  • Paid traffic is traffic obtained through advertising.
  • Organic traffic is non-paid traffic from search engines.
  • Social traffic refers to traffic originating from various social networking sites.

We shall employ multiple linear regression. The linear regression formula is as follows:

Before we begin, let’s have a look at how the software works:

  1. CSV files can be used to generate inputs.
  2. Import the necessary libraries.
  3. Divide the dataset into two parts: train and test.
  4. Use the regression to analyze paid, organic, and social traffic.
  5. Validation of the model

So let’s get started with our step-by-step linear regression demonstration! We’ll launch RStudio first because we’ll be doing linear regression.

We enter the following code in R:

# Import the dataset

sales <- read. CSV (‘Mention your download path’)

head(sales) #Displays the top 6 rows of a dataset

summary(sales) #Gives certain statistical information about the data. The output will look like the below:

dim(sales) # Displays the dimensions of the dataset

We are now going to plot the variables.

plot(sales) # Plot the variables to see their trends

Let’s look at how the variables are connected. We’ll solely use the numeric column values for this.

library(corrplot) # Library to find the correlation between the variables

num.cols<-sapply(sales, is.numeric)

num.cols

cor.data<-cor(sales[,num.cols])

cor.data

corrplot(cor.data, method=’color’)

As the following correlation matrix shows, the variables are highly correlated with one another and with the sales variable.

Let’s divide the data into training and testing sets immediately.

# Split the data into training and testing

set.seed(2)

library(caTools) #caTools has the split function 

split <- sample.split(sales, SplitRatio = 0.7) # Assigning it to a variable split, sample.split is one of the functions we are using. With the ratio value of 0.7, it states that we will have 70% of the sales data for training and 30% for testing the model

split

train <- subset(sales, split = ‘TRUE’) #Creating a training set 

test <- subset(sales, split = ‘FALSE’) #Creating testing set by assigning FALSE

head(train)

head(test)

View(train)

View(test)

Now that we have the test and training variables, let’s build the model:

Model <- lm(Revenue ~., data = train) #Creates the model. Here, lm stands for the linear regression model. Revenue is the target variable we want to track.

summary(Model) 

# Prediction

pred <- predict(Model, test) #The test data was kept for this purpose

pred #This displays the predicted values 

res<-residuals(Model) # Find the residuals

res<-as.data.frame(res) # Convert the residual into a dataframe

res # Prints the residuals

# compare the predicted vs actual values

results<-cbind(pred,test$Revenue)

results

colnames(results)<-c(‘predicted’,’real’)

results<-as.data.frame(results)

head(results)

# Let’s now, compare the predicted vs actual values

plot(test$Revenue, type = ‘l’, lty = 1.8, col = “red”)

The output of the preceding command is depicted in the graph below, which displays the projected revenue.

Let us now visualize our test revenue with the following command:

lines(pred, type = “l”, col = “blue”) #The output looks like below

Let’s plot the forecast completely using the following command:

plot(pred, type = “l,” lty = 1.8, col = “blue”) #The output looks like below; this graph shows the expected Revenue.

We can observe from the above result that the graphs of anticipated revenue and expected revenue are pretty similar. Let’s look at the precision so we can confirm the comparison.

# Calculating the accuracy

rmse <- sqrt(mean(pred-sales$Revenue)^2) # Root Mean Square Error is the standard deviation of the residuals

rmse

The output looks like below:

As you can see, the precision of this model is excellent. This concludes the demonstration.

Conclusion

You now understand why linear regression is required, what a linear regression model is, and how the linear regression method works. You also saw a real-world example in which we utilized RStudio to compute revenue based on our dataset. You learned about the different commands and packages in RStudio and how to plot a graph. Although this is a fantastic starting point, there is still much more to learn about linear regression.

GoLogica Technologies Private Limited  © 2019. All rights reserved.