TOTAL MONTHLY ENERGY AND PEAK HOUR DEMAND (Using Regression Model to Show that a Statistical Relationship Exists)
INTRODUCTION
Car dealers across North America use the Blue Book” to help them determine the value of used cars that their customers trade in when purchasing new cars. The book, which is published monthly, list the trade-in values for each car model according to its condition and optional features. The values determined on the basis of the average paid at recent used-car auctions, the source of supply for many used-car dealers. However, the Blue Book does not indicate the value determined by the odometer reading, despite the fact that a critical factor for used-car buyers is how far the car has been driven. To examine this issue, a used-car dealer randomly selected 100 three-year old Ford Tauruses that were sold at auction during the past month. Each car was in top condition and equipped with automatic transmission, AM/FM cassette tape player, and air conditioning. The dealer recorded the price ($1000) and the number of miles (thousands) on the odometer. Some of these data are listed below. The dealer wants to find the regression model to determine how the odometer readings (independent variables) is related to the auction selling prices (dependent variable) of used Ford Tauruses.
Objective :
- The Sample Regression Model
- The Standard Error of Estimate
- Test to Determine whether There is Enough Evidence to Infer that There is Linear Relationship between the Peak Hour Demand and the Total Monthly Energy for All Three-year-old.
- Measuring the strength of the linear relationship between total monthly energy and peak hour demand
- Are the peak hour demand and the total monthly energy Linearly related? Testing coefficient of correlation.
- Predicting the peak hour demand and estimating the mean peak hour demand
DATA AND PRESENTATION
Customer | Monthly Energy (kwh) | Peak Hour (kw) | Customer | Monthly Energy (kwh) | Peak Hour (kw) | ||
1 | 679 | 0.79 | 29 | 1381 | 3.48 | ||
2 | 292 | 0.44 | 30 | 1428 | 7.58 | ||
3 | 1012 | 0.56 | 31 | 1255 | 2.63 | ||
4 | 493 | 0.79 | 32 | 1777 | 4.99 | ||
5 | 582 | 2.70 | 33 | 370 | 0.59 | ||
6 | 1156 | 3.64 | 34 | 2316 | 8.19 | ||
7 | 997 | 4.73 | 35 | 1130 | 4.79 | ||
8 | 2189 | 9.50 | 36 | 463 | 0.51 | ||
9 | 1097 | 5.34 | 37 | 770 | 1.74 | ||
10 | 2078 | 6.85 | 38 | 724 | 4.10 | ||
11 | 1818 | 5.84 | 39 | 808 | 3.94 | ||
12 | 1700 | 5.21 | 40 | 790 | 0.96 | ||
13 | 747 | 3.25 | 41 | 783 | 3.29 | ||
14 | 2030 | 4.43 | 42 | 406 | 0.44 | ||
15 | 1643 | 3.16 | 43 | 1242 | 3.24 | ||
16 | 414 | 0.50 | 44 | 658 | 2.14 | ||
17 | 354 | 0.17 | 45 | 1746 | 5.71 | ||
18 | 1276 | 1.88 | 46 | 895 | 4.12 | ||
19 | 745 | 0.77 | 47 | 1114 | 1.90 | ||
20 | 795 | 3.70 | 48 | 413 | 0.51 | ||
21 | 540 | 0.56 | 49 | 1787 | 8.33 | ||
22 | 874 | 1.56 | 50 | 3560 | 14.94 | ||
23 | 1543 | 5.28 | |||||
24 | 1029 | 0.64 | |||||
25 | 710 | 4.00 | |||||
26 | 1434 | 0.31 | |||||
27 | 837 | 4.20 | |||||
28 | 1748 | 4.88 |
IDENTIFY
The problem objective is to analyze the relationship between two interval variables. Because we believe that the total monthly energy affects the peak hour demand, we identify the former as the independent variable (Total Monthly Energy), which we label x, and the latter as the dependent variable (Peak Hour Demand), which we label y.
The total monthly enegy and the peak hour demand is random variables. We hypothesize that for each possible total monthly energy, there is a theoritical population of peak hour demand that are normally distributed with a mean that is a linear function of the total monthly energy and a variance that is constant.
THEORY APPLIED
Simple Linear Regression
Regression analysis enables we to develop a model to predict the values of a numerical variables, based on the value of other variables. In Regression analysis, the variable we wish to predict is called the dependent variable. The variables useed to make the prediction are called independent variables. In addition to predicting values of the dependent variable, regression analysis also allows us to identify the type of mathematical relationship that exists between a dependent and an independent variable, to quantify the effect that changes in the independent variable have on the dependent variable, and to identify unusual observations. Regression analysis lets you use data to explain and predict.
Simple linear Regression is a single numerical independent variable, X, us used to predict the numerical dependent variable, Y. Simple regression’s linear is utilized to know variable independent’s influence (X) by dependent’s (Y) deep an equation model. Simple linear regression equation with least squares methods:
Where :
= Dependent Variable (sometimes reffered to as the response variable) for observation i.
= Y Intercept for the population
= Regression Coefficient or Slope for the population
= Independent Variable (sometimes reffered to as the explanatory variable) for observation i.
= random Error term in Y for observation i.
The slope of the line, , represents the expected change in Y per unit change in X. It represents the mean amount that Y changes (either positively or negatively) for a one-unit change in X. The Y intercept, ,represents the mean value of Y when X equals 0. The last component of the model, , is the vertical distance of the actual value of above or below the predicted value of on the line.
Linear Regression Equation
The predicted value of Y equals the Y intercept plus the slope times the value X
Where :
= Estimate (or predicted) Y value for observation i
= Estimate of the regression intercept
= Estimate of the regression slope
= Value of X for observation i
Measures of Variation
When using the least-squares method to determine the regression coefficcients for a set of data, you need to compute three important measures of variation. The first measure, the total sum of squares (SST) is a measure of variation of the values around their mean . In a regression analysis. The total variation or total sum of squares is subdivided into explained variation and unexplained variation. The explained variation or regression sum of squares (SSR) is due to the relationship between X and Y. And the unexplained variation, or error sum of squares (SSE) is due to factors other than the relationship between X and Y.
Total variation is made up of two parts:
Total Sum of Squares
Regression Sum of Squares
Error Sum of Squares
Where:
= Mean value of the dependent variable
Yi = Observed values of the dependent variable
i = Predicted value of Y for the given Xi value
Coefficient of Determination, r2
By themselves, SSR, SSE, and SST provide little information. However, the ratio of the regression sum of squares (SSR) to the total sum of squares (SST) measures the proportion of variation in Y that explained by the independent variable X in the regression model.This ratio is called the coefficient of determination. The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable.
The coefficient of determination is also called r-squared and is denoted as r2
Larger value of indicates a strong positive linear relationship between two variables because the use of a regression model has reduced the variability in predicting dependent variable.
Standard Error of the Estimate
Although the least-squares method results in the line that fits the data with minimum amount of error, unless all the observed data points fall on a straight line, the prediction line is not a perfect predictor. Just as all data values cannot be expected to be exactly equal to their mean, neither can they be expected to fall exactly on the prediction line. An important statistic, called the standard error of the estimate, measures the variability of the actual Y values from the predicted Y values in the same way that the standard deviation measures the variability of each value around the sample mean. In other words, the standard error of the estimate is the standard deviation around the prediction line, whereas the standard deviation is the standard deviation around the sample mean.
where
Yi = actual value of Y for a given Xi
=predicted value of Y for a given Xi
SSE = error sum of squares
Inferences About the Slope: t Test
– t test for a population slope
o Is there a linear relationship between X and Y?
– Null and alternative hypotheses
o H0: β1 = 0 (no linear relationship)
o H1: β1 ≠ 0 (linear relationship does exist)
If reject the null hypothesis, we conclude that there is evidence of a linear realtionship.
– Test statistic
Where:
b1 = Regression slope coefficient
β1 = Hypothesized Slope
Sb1 = Standard error of the slope
Confidence Interval Estimate for the Slope
The confidence interval estimate for hte slope can be constructed by taking the sample slope, b1, and adding and substracting the critical t value multiplied by the standard error of the slope.
Where :
ANALYSIS
Scatter Diagram
The Sample Regression Line
ŷ = b_{0} + b_{1}x
ŷ = 0,00385x – 0,8828
The slope coefficient b_{1 }is 0,00385, which means that for each additional 1 kwh on the monthly energy, the peak hour decreases by an average of 0,00385 kw. Expressed more simply, the slope tells us that for each additional kwh on the monthly energy the peak hour decreases on average by 0,00385 kw.
The intercept is b_{0 }= -0,8828. Technically, the intercept is the point at which the regression line and the y-axis intersect. This means that when x = 0, the peak hour is 0,8828. However, in this case, the intercept is probably meaningless. Because our sample did not include any customers with zero kwh on the monthly energy, we have no basis for interpreting b_{0}. As previously stated, we cannot determine the value of ŷ for a value of x that is far outside the range of the sample values of x. In this case, the smallest and largest values of x are 292 and 3560, respectively. Because x = 0 is not in this interval, we cannot safely interpret the value of ŷ when x = 0.
It is important to bear in mind that the interpretation of the coefficients pertains only to the sample, which consists of 50 observations. To infer information about the population, we need statistical inference techniques, which are described subsequently.
The Standard Error of Estimate
s_{ɛ} = 1,55886
The smallest value that s_{ɛ} can assume is 0, which occurs when SSE = 0, that is, when all the points fall on the regression line. Thus, when s_{ɛ} is small, the fit is excellent, and the linear model is likely to be an effective analytical and forecasting tool. If s_{ɛ}_{ }is large, the model is a poor one, and the statistics practitioner should improve it or discard it.
We judge the value of s_{ɛ} by comparing it to the values of the dependent variable y or more specifically to the sample mean . In this example, because s_{ɛ} = 1,55886 and = 3,4658, it does appear that the standard error of estimate is small. However, because there is no predefined upper limit on s_{ɛ}, it is often difficult to assess the model in this way. In general, the standard error of estimate cannot be used as an absolute measure of the model’s validity.
Nonetheless, s_{ɛ} is useful in comparing models. If the statistics practitioner has several models from which to choose, the one with the smallest value of s_{ɛ} should generally be the one used. And s_{ɛ}_{ }is also an important statistic in other procedures associated with regression analysis.
Test to Determine whether There is Enough Evidence to Infer that There is Linear Relationship between the Peak Hour Demand and the Total Monthly Energy for All Three-year-old
Testing the Slope Coefficient. Use 5% Significance Level
We test the hypotheses
If the null hypothesis is true, no linear relationship exists. If the alternative hypothesis is true, some linear relationship exists.
Degree of freedom = 48
T critical value = 2,0106
b1 = 0,00385; SSX = 20104980,32; sb1 = 0,000347
t test = 0,00385 – 0
0,000347
t test = 11,095
The rejection region is
t test < – 2,0106 or t test > 2,0106
The value of the test statistic is t_{test} 11,095 with a p-value of 0.056262193882634 0. There is overwhelming evidence to infer that a linear relationship exists. What this means is that the total monthly energy may affect the peak hour demand.
As was the case when we interpreted the y-intercept, the conclusion we draw is valid only over the range of the values of the independent variable. That is, we can infer that there is a relationship between total monthly energy and peak hour demand for 3 year-old whose total monthly energy lie between 292 and 3560 kwh (the minimum and maximum values of x in the sample). Because we have no observations outside this range, we do not know how, or even whether, the two variables are related outside the range.
Notice that the printout includes a test for . However, as we pointed out before, interpreting the value of the y-intercept can lead to erroneous, if not ridiculous, conclusions. Consequently, we generally ignore the test of .
We can also acquire information about the relationship by estimating the slope coefficient. In this case the 95% confidence interval estimate (approximatingg with 48 degrees of freedom with with 50 degrees of freedom) is
b_{1 }± ta/2sb1 = 0,00385 ± 2,0106 (0,000347)
= 0,00385 ± 0,000697
We estimate that the slope coefficient lies between 0,003153 and 0,004547.
Measuring the strength of the linear relationship between total monthly energy and peak hour demand
Coefficient of determination
R^{2 }= 0.,7187
We found that R^{2 }is equal to 0,7187. This statistic means that 71,87% of the variation in the peak hour demand is explained by the variation in the total monthly energy. The remaining 28,13% is unexplained. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general, the higher the value of R^{2}, the better the model fits the data. From the t_{test} of, we already know that there is evidence of a linear relationship. The coefficient of determination merely supplies us with a measure of the strength of that relationship. Furthermore when we improve the model, the value of R^{2 }increases.
Are the peak hour demand and the total monthly energy Linearly related? Testing coefficient of correlation.
The hypotheses to be tested are
Sample coefficient of correlation
r = 0,8477
The value of the test statistic is
t test = 11,095
The t_{test} of and the t_{test} of produced identical results. Both test (testing the slope coefficient and testing coefficient of correlation) are conducted to determine whether there is evidence of a linear relationship. The decision about which test to use us based on the type of experiment and the information we seek from the statistical analysis. If we’re interested in discovering the relationship between 2 variables, or if we’ve conducted an experiment where we controlled the values of the independent variable, the t_{test} of B should be applied. If we are interested only in determining whether two random variables that are bivariate normally distributed are linearly related, the t_{test} of should be applied.
Predicting the peak hour demand and estimating the mean peak hour demand
Predict the mena number of peak hour demand when the total monthly energy is 1500 kwh, we need to calculate the confidence interval estimator of expected value.
ŷ = 0,00385x – 0,8828 = 0,00385 (1500) – 0,8828
ŷ = 4,8922
The point prediction does not provide any information about how closely the value will match the true peak hour demand. To discover that information, we use an interval. In fact, we can use one of two intervals which is the prediction interval of a particular value of y or the confidence interval estimator of the expected value of y.
Confidence Interval Estimate | |
Data | |
X Value | 1500 |
Confidence Level | 95% |
Intermediate Calculations | |
Sample Size | 50 |
Degrees of Freedom | 48 |
t Value | 2.010635 |
Sample Mean | 1132.56 |
Sum of Squared Difference | 20104980 |
Standard Error of the Estimate | 1.558861 |
h Statistic | 0.026715 |
Predicted Y (YHat) | 4.892789 |
For Average Y | |
Interval Half Width | 0.512296 |
Confidence Interval Lower Limit | 4.380493 |
Confidence Interval Upper Limit | 5.405085 |
For Individual Response Y | |
Interval Half Width | 3.175891 |
Prediction Interval Lower Limit | 1.716898 |
Prediction Interval Upper Limit | 8.06868 |
The lower and upper limits of the confidence interval estimate of the expected value are 4,380493 and 5,405085. The lower and upper limits of the prediction interval are 1,716898 and 8,06868.
We predict that between1,716898 and 8,06868. The average peak hour demand of the population of 3-year–old is estimated to lie between 4,380493 and 5,405085. Because predicting the peak hour demand of one customer is more difficult than estimating the mean peak hour demand of all similar customers, the prediction interval is wider than the interval estimate of the expected value.
CONCLUSION
- In this case we showed that the odometer reading is linearly related to the auction price. Although it seems reasonable to conclude that decreasing the odometer reading would cause the auction price to rise, the conclusion may not be entirely true. we can infer that there is a relationship between odometer reading and auction price for 3 year-old Ford Tauruses whose odometer readings lie between 19,1 (thousand) and 49,2 (thousand) miles (the minimum and maximum values of x in the sample). Because we have no observations outside this range, we do not know how, or even whether, the two variables are related outside the range. It is theoretically possible that the price is determined by the overall condition of the car and that the condition generally worsens whe the car is driven longer. Another analysis would be needed to establish the veracity of this conclusion.
- A scatter plot and the line of best fit show the relationship between price and mileage
- If we’re interested in discovering the relationship between 2 variables, or if we’ve conducted an experiment where we controlled the values of the independent variable, the t_{test} of B should be applied. If we are interested only in determining whether two random variables that are bivariate normally distributed are linearly related, the t_{test} of should be applied.
- If the model fits satisfactorily, we can use it to forecast and estimate the values of the dependent variable.
- The point prediction does not provide any information about how closely the value will match the true selling price.
- Observational data can be analyzed in another way. When the data are observational, both variables are random variables. We don’t need specify that one is independent and the other is dependent. We can simply determine whether the two variables are related or not.
- REFERENCE
- 1. Levine, Stephen, Krehbiel, and Berenson. 2008. Statistics for Managers, Prentice Hall, 5^{th} editon
2. Keller. 2005. Statistics for Management and Economics, THOMSON, 7^{th} editon
Editor : Felix Yuwono, Natalyna Kosasih and Inggria Lestari
Recent Comments