Impact of Factors on Student Performance: A Regression Study Sample Paper-Get Cheap Research and Thesis Help Online
Section 1: Introduction
Section 2: Data & Methodology
Describe you DATA and regression equation. Show the statement of hypothesis and define the variables. What manner of relationship do you expect to find? (Max ½ page)
Section 3: Analysis and Results
Present summary of your final results. Be sure to show a summary of results of any transformations you have performed. Also, provide a brief explanation of the results. Are the results consistent with your expectations prior to the study? (Max ½ – 1 page)
Section 4: Conclusions and Study Limitations
Briefly summarize your conclusions. Any limitation and improvements? (Max ½ page)
Sample Research Paper on Regression Analysis
A Regression Study on the Impact of
Factors on Student Performance[1]
- Introduction
The study seeks to determine if a relationship exists between student performance and a set of
socioeconomic factors.
- Data and Methodology
Empirical data for this study was obtained from 224 districts in the State of Massachusetts. The response variable (Y) is student performance, and is measured by the mean score of the Massachusetts Comprehensive Assessment System (MCAS) administered in May 1998 to 10^{th} graders. The following five-variable multiple regression model is specified.
Y = β_{0} + β_{1}X_{1} + β_{2}X_{2} + β_{3}X_{3 }+ β_{4}X_{4} + ε
Where:
X_{1} = Student-to-teacher ratio
X_{2} = Average teacher’s salary
X_{3} = Median household income
X_{4 }= Percentage of single family household
Pursuant to the purpose of this study, the regression null hypothesis is as follows:
H_{0}: β_{0} = β_{1} = β_{2} = β_{3} = β_{4} [There is no regression relationship]
The dataset for the study is presented in Appendix 1. Before the facts it is expected that X_{1} and X_{4} will have a negative impact on student performance. A high student-teacher-ratio (X_{1}) suggest a less a less individualized attention in the classroom. In the same vein, a high percentage of single family household (X_{4}) might limit the extent of the type of comprehensive family care – from both parents – which many believe is essential in the wholesome growth and development of a child.
The impact of teacher’s salary (X_{2}) and household income (X_{3}) is expected to be positive. A high teacher’s salary may serve as an incentive for a more effective teaching in the classroom in that the teacher is better motivated. It is also arguable that households with more disposable income are more likely to be a able to provide their children with better and/or additional learning support, which may include private tutoring electronic learning device, and a quality of life in which the negative impact of poverty on the emotional wellbeing of a child is eliminated.
- Analysis and Results
Regression results are summarized in Table 1, below. The F-statistic of 88.37 (p-value = 0.00) indicates that the regression, as a whole, is statistically significant. The coefficient of determination (r^{2}) suggests that about 62 percent of the variation in student performance is explained by the four explanatory variables, combined. In the absence of any significant residual problem, the prediction model is specified as follows:
ŷ = 231.89 – 0.496X_{1} – 0.496X_{2} – 0.023X_{3} – 0.878X_{4}
Table 1: Summary of Regression Results
R^{2} | F | p-value | ||
0.6174 | 88.37 | 0.0000 | ||
Coefficient | Std. Error | t-Statistics | p-value | |
Intercept | 231.8944 | 3.3696 | 68.8201 | 0.0000 |
STR (X_{1}) | -0.4961 | 0.1308 | -3.7944 | 0.0002 |
TSAL (X_{2}) | -0.0233 | 0.0746 | -0.3129 | 0.7547 |
INC (X_{3}) | 0.2929 | 0.0344 | 8.5093 | 0.0000 |
SGL (X_{4}) | -0,8780 | 0.1737 | -5.0556 | 0.0000 |
The test of significance for the independent variables – measured by the t statistics – shows that with the exception of teacher’s salary (X_{2}), all the explanatory variables are statistically significant. In other words, the contribute meaningful information in the prediction of student performance. The signs of the coefficients for the independent variables are consistent with the prior expectation with the exception of X_{2}, which nevertheless is not significant. Taking the value of the coefficient for X4 as an example, it suggests that the average student score decreases by about 0.88 for every percentage increase in single family household.
- Test for multicollinearity
Often, when collinearity exists, the sign of the regression coefficient is the opposite of what logic suggests. Also, the t-value may not be significant even when the F is significant. To confirm that the lack of statistical significance and illogical sign associated with X_{2 }(teacher’s salary) is not due to multicollinearity, the correlations between the independent variables are examined. The correlation half matrix is presented in Table 2.
Table 2: Correlation Matrix of Independent Variables
X_{1} | X_{2} | X_{3} | X_{4} | |
X_{1} | 1 | |||
X_{2} | 0.0772 | 1 | ||
X_{3} | -0.1635 | 0.5503 | 1 | |
X_{4} | 0.2829 | -0.2433 | -0.6179 | 1 |
The correlation coefficient between teacher’s salary (X_{2}) and household income (X_{3}) is 0.55, which may be considered high. Even larger is the correlation coefficient between household income (X_{3}) and percentage of single family households (X_{4}), which is about -0.62. This is, however, a casual observation. Amore rigorous analysis of multicollinearity is based on the calculation of the variance inflation factor (VIF) for each of the variables, defined as follows:
VIF = 1/(1 – R^{2})
The results of the variance inflation factors – for each of the independent variables – are summarized in Table 3
Table 3: Variance Inflation Factors (VIF)
X_{1} | 1.12 |
X_{2} | 1.51 |
X_{3} | 2.25 |
X_{4} | 1.73 |
A high VIF means that the variance (and therefore standard error) of the regression coefficient is inflated, so that the corresponding t-value is less than it should be. A helpful rule of thumb is that collinearity exists if VIF > 5. As Table 3 shows, the highest VIF is 2.25, meaning that Var (b3) is only 2.25 times what it should be if collinearity did not exist. This highest VIF is too small to cause concern. With this finding, one can conclude that collinearity is probably not a problem in the model.
- Test for Heteroskedasticity
Heteroskedasticity, which is the problem of unequal error variance, is typically encountered with cross-sectional data. When heteroskedasticity is present, the regression coefficients (i.e. the bi) are no longer efficient. Loss of efficiency is because the standard errors and confidence intervals are too narrow, giving a false sense of precision.
A casual and often effective means to see if this problem exists is the residual plot. If heteroskedasticity exists, the plot tends to spread out as the value of X (or ŷ) increase. This means that as the values of X become larger, there is increasing uncertainty associated with the response of Y. Figure 1 shows the residual plot.
Figure 1: Residual Scatter Plot
While the scatter of the residuals appears more robust in the mid-section of the graph than at the beginning, there is no definitive indication that heteroskedasticity is present. Nevertheless, a popular and often effective variance stabilizing transformation is the logarithmic transformation of only the value of Y, as follows:
Ln(Y) = β_{0} + β_{1}X_{1} + β_{2}X_{2} + β_{3}X3 + β_{4}X_{4}
The regression results of this transformed model are summarized in Table 4. These results do not show a marked improvement over the original model. The coefficients have the same manner of statistical significance as before. Also, the value of the coefficient of determination is virtually unchanged.
Table 4: Summary Results of the Logarithmic Transformation
R^{2} | F | p-value | ||
0.6188 | 88.88 | 0.0000 | ||
Coefficient | Std. Error | t-Statistic | p-value | |
Intercept | 5.4485 | 0.0145 | 375.9715 | 0.0000 |
STR (X_{1}) | -0.0021 | 0.0006 | -3.7968 | 0.0002 |
TSAL (X_{2}) | -0.0001 | 0.0003 | -0.3626 | -0.7172 |
INC (X_{3}) | 0.0012 | 0.0001 | 8.3710 | 0.0000 |
SGL (X_{4}) | -0.0040 | 0.0007 | -5.3021 | 0.0000 |
With this outcome, the final prediction model is based on the original (non-transformed) model.
- Conclusions, Limitations, and Future Improvements
This study examined the relationship between student performance in grammar school and a set of socioeconomic factors. Student performance is measured by the mean score on the Massachusetts Comprehensive Assessment (MCAS) administered to 10^{th} graders.
Results show that student-teacher ratio, teacher’s salary, household income, and percentage of single family households, as a whole, affect student performance. Altogether, they account for about 62 percent of changes in the mean student’s assessment score. More specifically, there is evidence that evidence that students’ performance rises with household income., on average. On the flip, there is also evidence that performance is hampered by a high student-teacher-ratio as well as by a high incidence of single family households.
Due to the reasonably high coefficient of determination, it may be possible to predict student performance using the parameter estimates of this model. For example, if STR(X_{1}) = 18, TSAL(X_{2}) = 50, INC(X_{3}) = 60, and SGL(X_{4}) = 5, the predicted mean students score would be determined as follows:
ŷ = 231.89 – 0.496(18) – 0.023(50) + 0.293(60) – 0.878(5) = 234.98
An important improvement to this study is an investigation of how the study time spent by students outside of school and the level of education of the students’ parents might affect performance. Because of the widely publicized disparity in performance across ethnic lines, it may also be helpful to determine if, after accounting for all the above factors, there is still a difference in performance between the various racial groups. This latter inquiry can be pursued as a comparative study.
[1] Adapted from Pat Obi, Professor Purdue University.