Predicting the values from the available/ observed dataset requires a mathematical model that correctly fits the data. Statistical methods are used to indicate the model’s suitability and see how the data varies from the estimated model for R-squared.
R-squared is a measure to see the goodness of fit of the regression models where depending on the relationship of the variables, the strength of relation with the model is measured on a scale of 0 to 100%.
On fitting the linear regression model with the observed data points, it is necessary to see how well the model fits the rest of the data and the rate of success for correcting the values. Several methods to test the goodness of fit are present, from which in this article, we will discuss the R-squared method and how it relates to the model fitness estimation. In further sections, we will discuss correlation and regression analysis and further analyse the R-square method to build up the concept in a structured way.
It is a statistical measure that expresses the linear relationship between two variables and the extent to which they are linearly related (concerning direction and strength).
The measure of correlation is given by a unit-free measure called the coefficient of correlation (r), which quantifies the strength and varies between -1 to 1.
-1 represents the negative correlation where one variable’s value tends to decrease when the other variable value increases, whereas 1 represents a positive correlation where an increase of one variable also increases another variable. 0 represents the weak relationship between variables.
Regression is a method of analysis to fit data to get the equation to predict the data. It is used to guess the best value using the dataset available to make the prediction.
For regression analysis, we are required to form a regression line. For a regression line, a scatter plot is plotted between the quantitative variables, and then a line that is the good fit to the data set is needed to be formed by the regression. The line represents the pattern of data known as the regression line. It predicts the change in Y-axis when the X-axis value increases.
Equation of the regression line is given by :
Y = a + bx
Y is the predicted value of y, a is the y-intercept, and b is the slope.
For calculating the regression constants (a and b)
- Make scatter plot
- Calculate Mean (m) and Standard deviation (SD)
- Calculate the correlation coefficient (r).
a = Y – bx
b = r * (Sy / Sx )
Fitting of Data
Linear regression forms the equation, which shows the difference between the observed value and the fit values. The purpose of linear regression is to find the line of least square regression that fits the data set.
For the best fit model, the difference between the predicted value and the observed must be small and unbiased, i.e., the fitted values must not be too high or too low.
The best method to analyze the fitting of data is by evaluating residual plots. A residual plot is a graphical representation to show the relationship between the independent and the response variable. If the plot has any over-fitted value, data needs to be revisited. Then we can use statistical methods like R-squared, which measures the goodness- of – fit.
The R-squared method can be used once we get the unbiased model of the data available.
R-squared and a good fit
Once the best-fit regression line is formed, the r-squared gives the scatter of the line’s values. Also known as the coefficient of determination, it is the measure of how close each data point fits into the regression line.
In a linear model, the dependent variable variation in percentage is given by the R-squared. It is given in percentage, which varies between 0 to 100 %.
The model where there is no variation around the mean is 0%. The dependent variable mean is used to predict the regression model as well as the dependent variable.
0% stands for a non-explanatory model that does not inform us about any changes in its response variable mean. The dependent-variable mean gives us the dependent variable and also the regression model.
The model that explains every variation of the response variable about its mean is 100% R-squared value. The best-fit regression model has a larger R-squared value for observation.
How is the R-squared value represented?
The visual representation of how we observe the R-squared values by the scatter plot and the regression line is explained here. To form the graph, we need to plot the graph from the observations with the independent variable on the x-axis and the dependent variable on the y-axis.
In regression models, for the R-squared value to be 100%, it is required that all the observation points in the scatter plot must lie exactly on the regression line, which is not possible in case practice. In the plots of two different plots, if the graph’s representation where the R-squared value is low, the points are loosely bound to the regression line, whereas for higher R-squared values, the data points need to be close to the line.
The more number points that lie close to the regression line increase the R-value and show that the points are close to the best fit regression line.
The Difference between R-Squared and Adjusted R-Squared
For a single linear regression model, we have only one variable, and for that, we calculate the R-squared value of the model. In the case of a multi-regression model, where there is more than one independent variable, R-squared is not defined for these cases.
So, we need the adjusted R-squared method for the multiple regressions. In the adjusted R—squared method, diverse numbers of predictors are available, which is compared to the regression model’s descriptive power. In addition to every predictor, the R- squared values keep increasing. So, models with multiple independent variables have a better fit for the regression model. Still, the adjusted R-square value compensates when more variables are included. It decreases if any new term included doesn’t improve the model and increases if the new variable enhances the model.
R-squared sometimes gives a highly incorrect value in an over-fitted model even when it has lost its ability to give the right prediction. Adjusted R-squared doesn’t have this problem.
The R-squared method is a good method for seeing data’s fitness, but it has some limitations. The R-squared value depends on the estimated relationship of the dependent variable with respect to the independent variable changes. So, deciding from the value whether the model fit is good or bad isn’t reliable. R-squared cannot determine the coefficient estimates and biased predictive values, so we must examine the residual plots. Also, the high or low R-squared value cannot decide the good model, as for the loosely fit model we or biased we could reach up to very low R-squared value or very high value.
To get a better-estimated model, we need to examine the model with different parameters and various plots.
Low R-squared Values Importance
Models having low R-squared values could also be perfectly good for their reasons. Few fields of observation have a large amount of unexplainable variation, which is supposed to have a low R-squared value.
For statistically significant independent variables, important conclusions could be drawn even from the low R-squared value.
High R-squared Values Issues
High R-squared values of the regression models have a lot of problems attached to it. The inflation of R-squared could be caused by reasons like data mining and overfitted models.
It’s expected a high r-squared value model will be a good model, but if the regression line over-predicts or under-predicts the data with the curve makes it a biased model. It can be observed by creating the Residual Vs Fit plot. For an unbiased model, the residuals must be scattered around zero. For non-random distribution patterns, even having a high R-squared represents a bad fit model.
The regression model is a useful method in predicting the values from the observed data. While R-squared measure is a method that is seemingly intuitive, it shows how well the linear model fits the observed dataset. The R-squared method tells us the movement of data around the regression line. Still, the method doesn’t completely explain the whole story as over-fitted, or loosely fitted models give R-squared incorrect values.
We need to examine the models with residual plots, residual vs. fit plot, etc., to verify the model applicability and see if the estimated model is the best fit. Multiple regression method requires adjusted R-squared method for their correct measurement as on using the R-squared method, the value keeps on increasing along with the addition of variables. The R-squared method only tests the strength of the relationship between the variables (independent and response) but cannot provide a full image of the model.
The biased graph generally occurs when we miss examining the extreme points of data during building models. Building machine learning concepts is based on knowledge of probability and data analysis. To learn about what is data, statistics, and probability, get the experience of learning with Cuemath and start exploring more about the conceptual understanding of Machine Learning.