## Abstract

The use of linear regression is to predict a trend in data or predict the value of a variable (dependent) from the value of another variable (independent), by fitting a straight line through the data. Dallal (2000), examined how significant the linear regression equation is, how to use it to draw the best fitting line of the scatter plot and how important the best fitting line is.

## Article review

The use of linear regression is to predict a trend in data or predict the value of a variable (dependent) from the value of another variable (independent), by fitting a straight line through the data. Linear regression represents a connecting link between the independent (carrier) variable and dependent (response) variable, which is graphed on X and Y-coordinates, resulting in a straight line. Linear regression shows the straight line which thoroughly represents, or predicts, the value of the response variable, given the noted value of the carrier variable (Frey, 2006). This essay aims at reviewing the article introduction to simple linear regression by Dallal (2000).

## Problem statement

Dallal (2000) assumed a relationship between body mass (independent or carrier variable) and muscle strength (dependent or response variable), the more body mass the more muscle strength. However, this relationship is not without exceptions, which is reflected in the scatter plot of a regression model. Therefore, the author posed the question of how to illustrate the straight line, which accurately portrays the data, or predicts the value of the response variable.

## Research purpose statement

In the given example, most cases would show a perfect regression. However, standardization of the procedure of putting in a straight line is necessary to provide better communication and common grounds for analysts working on the same data. Further, in the example regression equation given (Strength = -13.971 + 3.016 LBM [Lean Body mass]), one can draw two conclusions; first, a predicted muscle strength equals LBM multiplied by 3.016 minus 13.971. Second, the difference between the muscle strength of two individuals is presumably 3.016 multiplied by the difference in their LBM.

## Research questions

**Why do we need to fit a regression equation into a set of data?**

It is clear from the previous example there are reasons for fitting a regression equation into a set of data. These are 1) to describe the data, and 2) to predict an independent (response) variable from a dependent (carrier) one.

**What is the underlying principle of calculating a straight line?**

If the points signaling data in a scatter plot are close to a line, it means the line represents, matches, or gives a good fit of data. If not, then the line with most of the points closer to it than any other is the one that gives a good fit of data. Further, If the is used to predict values, these values should be close enough to the noted ones, in other words, residuals (observed values – predicted values) should be small values.

**How linear regression (least-squares) equation is used to illustrate the best fitting line?**

The standard used, as the name implies, is the sum of squared residuals (observed – predicted values) is minimal for the best fitting line. This applies to a line fitted to a set of sample data to promote generalization to a population from which this sample was taken. Yet for a population, there is a slightly different linear regression equation. The equation illustrates that an output (dependent) variable on the Y-axis can be predicted from an input (independent) variable on the X-axis after adding a random error (si).

**Is the sample regression equation an accurate estimate of the population regression equation?**

There is a reservation for accreditation of this statement, which is directed at the confidence bands about the regression line. They are understood as the standard error of the mean (the standard deviation of the mean of the sampling distribution). Yet with one exception that is the sampling mean of the dependent variables amplifies as it adds distance from the mean.

## Sources of data

Dallal (2000), stated in the second part of his article (linked to the main article) are cross-sectional data. This type of data has the advantage of being used if the sampling method is not weighted and-or un-stratified. This method can also be used if the researcher is concerned only with minor or small probabilities. The longitudinal data results in more statistical power, however, in repeated cross-sectional analysis, new subjects added per analysis compensates for the inherent decreased statistical power (Yee and Niemeier, 1996).

## Data collection strategies and methods

A good data collection strategy should have two objectives, namely, having motivated respondents (affected by time-consuming, trust in statistics, the difficulty of the questionnaire, and benefit included). The second objective should be having high-quality data, tailored to sample individuals, sampling method, and good instruments of data collection (Statistics Norway, 2007).

Methods of data collection are many and the selection of a particular method depends on the available resources, reliability, resources of analysis and reporting, besides the skills and knowledge of the analyst. Some of these methods are case studies, behavior observation checklists, attitude, and opinion surveys, questionnaires distributed by mail, e-mail, or phone calls. Other methods of data collection include time series (evaluating one variable over some time as a week), and individual or group interviews (The Ohio State University Bulletin Extension, 2005).

## Conclusions

Dallal (2000), inferred that simple linear regression means that we can predict a dependent variable from an independent one, so whenever we need to know from the beginning each time we add information. The regression line is important as it makes the estimation of a dependent variable more accurate and it allows the estimation of a response variable for individuals with values of the carrier variable not included in the data. The author also inferred there are two methods of predicting a variable either from within the range of values of the independent variable of the sample given (interpolation) or outside this range (extrapolation). The author recommended the first method as it has the advantage of being safe, yet with concerns as regards the way to demonstrate the linearity of the relationship between the two variables.

## References

Dallal, G. (2000). Introduction to simple linear regression. Web.

Frey, B. (2006). *Statistics Hacks*. Sebastopol, CA: O’Reilly Media Inc.

Statistics Norway (2007). *Strategy for data collection*. Web.

The Ohio State University (2005). *Bulletin Extension – Step Four: Methods of Data Collection*. Web.

Yee J L. and Niemeier D (1996). Advantages and Disadvantages: Longitudinal vs. Repeated Cross-Section Survey-A Discussion Paper. *Project Battelle*, *94*, 16-22.