Impact of Factors on Student Performance: A Regression Study Sample Paper-Get Cheap Research and Thesis Help Online

Impact of Factors on Student Performance: A Regression Study Sample Paper-Get Cheap Research and Thesis Help Online

Section 1: Introduction

Provide a brief background of the study. What is the study about? What is the significance of the study? Are there any benefits that may come out of the findings? If there are any background studies, cite them (Max ½ page)

Section 2: Data & Methodology

Describe you DATA and regression equation. Show the statement of hypothesis and define the variables.  What manner of relationship do you expect to find? (Max ½ page)

Section 3: Analysis and Results

Present summary of your final results. Be sure to show a summary of results of any transformations you have performed. Also, provide a brief explanation of the results. Are the results consistent with your expectations prior to the study? (Max ½ – 1 page)

Section 4: Conclusions and Study Limitations

Briefly summarize your conclusions. Any limitation and improvements? (Max ½ page)


Sample Research Paper on Regression Analysis


A Regression Study on the Impact of

Factors on Student Performance[1]



  1. Introduction

The study seeks to determine if a relationship exists between student performance and a set of

socioeconomic factors.


  1. Data and Methodology

Empirical data for this study was obtained from 224 districts in the State of Massachusetts. The response variable (Y) is student performance, and is measured by the mean score of the Massachusetts Comprehensive Assessment System (MCAS) administered in May 1998 to 10th graders. The following five-variable multiple regression model is specified.


Y =  β0 + β1X1 + β2X2 + β3X3 + β4X4 + ε



X1 = Student-to-teacher ratio

X2 = Average teacher’s salary

X3 = Median household income

X4 = Percentage of single family household


Pursuant to the purpose of this study, the regression null hypothesis is as follows:


H0: β0 = β1 = β2 = β3 = β4 [There is no regression relationship]


The dataset for the study is presented in Appendix 1. Before the facts it is expected that X1 and X4 will have a negative impact on student performance. A high student-teacher-ratio (X1) suggest a less a less individualized attention in the classroom. In the same vein, a high percentage of single family household (X4) might limit the extent of the type of comprehensive family care – from both parents – which many believe is essential in the wholesome growth and development of a child.


The impact of teacher’s salary (X2) and household income (X3) is expected to be positive. A high teacher’s salary may serve as an incentive for a more effective teaching in the classroom in that the teacher is better motivated.  It is also arguable that households with more disposable income are more likely to be a able to provide their children with better and/or additional learning support, which may include private tutoring electronic learning device, and a quality of life in which the negative impact of poverty on the emotional wellbeing of a child is eliminated.


  1. Analysis and Results

Regression results are summarized in Table 1, below. The F-statistic of 88.37 (p-value = 0.00) indicates that the regression, as a whole, is statistically significant.  The coefficient of determination (r2) suggests that about 62 percent of the variation in student performance is explained by the four explanatory variables, combined. In the absence of any significant residual problem, the prediction model is specified as follows:


ŷ  = 231.89 – 0.496X1 – 0.496X2 – 0.023X3 – 0.878X4


Table 1: Summary of Regression Results


R2 F p-value
0.6174 88.37 0.0000
Coefficient Std. Error t-Statistics p-value
Intercept 231.8944 3.3696 68.8201 0.0000
STR (X1) -0.4961 0.1308 -3.7944 0.0002
TSAL (X2) -0.0233 0.0746 -0.3129 0.7547
INC (X3) 0.2929 0.0344 8.5093 0.0000
SGL (X4) -0,8780 0.1737 -5.0556 0.0000


The test of significance for the independent variables – measured by the t statistics – shows that with the exception of teacher’s salary (X2), all the explanatory variables are statistically significant. In other words, the contribute meaningful information in the prediction of student performance. The signs of the coefficients for the independent variables are consistent with the prior expectation with the exception of X2, which nevertheless is not significant. Taking the value of the coefficient for X4 as an example, it suggests that the average student score decreases by about 0.88 for every percentage increase in single family household.


  • Test for multicollinearity

Often, when collinearity exists, the sign of the regression coefficient is the opposite of what logic suggests. Also, the t-value may not be significant even when the F is significant. To confirm that the lack of statistical significance and illogical sign associated with X2 (teacher’s salary) is not due to multicollinearity, the correlations between the independent variables are examined.  The correlation half matrix is presented in Table 2.


Table 2: Correlation Matrix of Independent Variables


X1 X2 X3 X4
X1 1
X2 0.0772 1
X3 -0.1635 0.5503 1
X4 0.2829 -0.2433 -0.6179 1


The correlation coefficient between teacher’s salary (X2) and household income (X3) is 0.55, which may be considered high. Even larger is the correlation coefficient between household income (X3) and percentage of single family households (X4), which is about -0.62. This is, however, a casual observation. Amore rigorous analysis of multicollinearity is based on the calculation of the variance inflation factor (VIF) for each of the variables, defined as follows:


VIF = 1/(1 – R2)


The results of the variance inflation factors – for each of the independent variables – are summarized in Table 3


Table 3: Variance Inflation Factors (VIF)


X1 1.12
X2 1.51
X3 2.25
X4 1.73


A high VIF means that the variance (and therefore standard error) of the regression coefficient is inflated, so that the corresponding t-value is less than it should be. A helpful rule of thumb is that collinearity exists if VIF > 5. As Table 3 shows, the highest VIF is 2.25, meaning that Var (b3) is only 2.25 times what it should be if collinearity did not exist. This highest VIF is too small to cause concern. With this finding, one can conclude that collinearity is probably not a problem in the model.


  • Test for Heteroskedasticity

Heteroskedasticity, which is the problem of unequal error variance, is typically encountered with cross-sectional data. When heteroskedasticity is present, the regression coefficients (i.e. the bi) are no longer efficient. Loss of efficiency is because the standard errors and confidence intervals are too narrow, giving a false sense of precision.


A casual and often effective means to see if this problem exists is the residual plot.  If heteroskedasticity exists, the plot tends to spread out as the value of X (or ŷ) increase. This means that as the values of X become larger, there is increasing uncertainty associated with the response of Y. Figure 1 shows the residual plot.


Figure 1: Residual Scatter Plot


While the scatter of the residuals appears more robust in the mid-section of the graph than at the beginning, there is no definitive indication that heteroskedasticity is present. Nevertheless, a popular and often effective variance stabilizing transformation is the logarithmic transformation of only the value of Y, as follows:


Ln(Y) = β0 + β1X1 + β2X2 + β3X3 + β4X4


The regression results of this transformed model are summarized in Table 4. These results do not show a marked improvement over the original model. The coefficients have the same manner of statistical significance as before.  Also, the value of the coefficient of determination is virtually unchanged.


Table 4: Summary Results of the Logarithmic Transformation


R2 F p-value
0.6188 88.88 0.0000
Coefficient Std. Error t-Statistic p-value
Intercept 5.4485 0.0145 375.9715 0.0000
STR (X1) -0.0021 0.0006 -3.7968 0.0002
TSAL (X2) -0.0001 0.0003 -0.3626 -0.7172
INC (X3) 0.0012 0.0001 8.3710 0.0000
SGL (X4) -0.0040 0.0007 -5.3021 0.0000


With this outcome, the final prediction model is based on the original (non-transformed) model.


  1. Conclusions, Limitations, and Future Improvements

This study examined the relationship between student performance in grammar school and a set of socioeconomic factors. Student performance is measured by the mean score on the Massachusetts Comprehensive Assessment (MCAS) administered to 10th graders.


Results show that student-teacher ratio, teacher’s salary, household income, and percentage of single family households, as a whole, affect student performance.  Altogether, they account for about 62 percent of changes in the mean student’s assessment score. More specifically, there is evidence that evidence that students’ performance rises with household income., on average. On the flip, there is also evidence that performance is hampered by a high student-teacher-ratio as well as by a high incidence of single family households.


Due to the reasonably high coefficient of determination, it may be possible to predict student performance using the parameter estimates of this model. For example, if STR(X1) = 18, TSAL(X2) = 50, INC(X3) = 60, and SGL(X4) = 5, the predicted mean students score would be determined as follows:


ŷ = 231.89 – 0.496(18) – 0.023(50) + 0.293(60) – 0.878(5) = 234.98


An important improvement to this study is an investigation of how the study time spent by students outside of school and the level of education of the students’ parents might affect performance. Because of the widely publicized disparity in performance across ethnic lines, it may also be helpful to determine if, after accounting for all the above factors, there is still a difference in performance between the various racial groups. This latter inquiry can be pursued as a comparative study.










[1] Adapted from Pat Obi, Professor Purdue University.