TOTAL MONTHLY ENERGY AND PEAK HOUR DEMAND (Using Regression Model to Show that a Statistical Relationship Exists)

INTRODUCTION

Car dealers across North America use the Blue Book” to help them determine the value of used cars that their customers trade in when purchasing new cars. The book, which is published monthly, list the trade-in values for each car model according to its condition and optional features. The values determined on the basis of the average paid at recent used-car auctions, the source of supply for many used-car dealers. However, the Blue Book does not indicate the value determined by the odometer reading, despite the fact that a critical factor for used-car buyers is how far the car has been driven. To examine this issue, a used-car dealer randomly selected 100 three-year old Ford Tauruses that were sold at auction during the past month. Each car was in top condition and equipped with automatic transmission, AM/FM cassette tape player, and air conditioning. The dealer recorded the price ($1000) and the number of miles (thousands) on the odometer. Some of these data are listed below. The dealer wants to find the regression model to determine how the odometer readings (independent variables) is related to the auction selling prices (dependent variable) of used Ford Tauruses.

Objective :

  1. The Sample Regression Model
  2. The Standard Error of Estimate
  3. Test to Determine whether There is Enough Evidence to Infer that There is Linear Relationship between the Peak Hour Demand and the Total Monthly Energy for All Three-year-old.
  4. Measuring the strength of the linear relationship between total monthly energy and peak hour demand
  5. Are the peak hour demand and the total monthly energy Linearly related? Testing coefficient of correlation.
  6. Predicting the peak hour demand and estimating the mean peak hour demand

DATA AND PRESENTATION

Customer Monthly Energy (kwh) Peak Hour (kw) Customer Monthly Energy (kwh) Peak Hour (kw)
1 679 0.79 29 1381 3.48
2 292 0.44 30 1428 7.58
3 1012 0.56 31 1255 2.63
4 493 0.79 32 1777 4.99
5 582 2.70 33 370 0.59
6 1156 3.64 34 2316 8.19
7 997 4.73 35 1130 4.79
8 2189 9.50 36 463 0.51
9 1097 5.34 37 770 1.74
10 2078 6.85 38 724 4.10
11 1818 5.84 39 808 3.94
12 1700 5.21 40 790 0.96
13 747 3.25 41 783 3.29
14 2030 4.43 42 406 0.44
15 1643 3.16 43 1242 3.24
16 414 0.50 44 658 2.14
17 354 0.17 45 1746 5.71
18 1276 1.88 46 895 4.12
19 745 0.77 47 1114 1.90
20 795 3.70 48 413 0.51
21 540 0.56 49 1787 8.33
22 874 1.56 50 3560 14.94
23 1543 5.28
24 1029 0.64
25 710 4.00
26 1434 0.31
27 837 4.20
28 1748 4.88

IDENTIFY

The problem objective is to analyze the relationship between two interval variables. Because we believe that the total monthly energy affects the peak hour demand, we identify the former as the independent variable (Total Monthly Energy), which we label x, and the latter as the dependent variable (Peak Hour Demand), which we label y.

The total monthly enegy and the peak hour demand is random variables. We hypothesize that for each possible total monthly energy, there is a theoritical population of peak hour demand that are normally distributed with a mean that is a linear function of the total monthly energy and a variance that is constant.

THEORY APPLIED

Simple Linear Regression

Regression analysis enables we to develop a model to predict the values of a numerical variables, based on the value of other variables. In Regression analysis, the variable we wish to predict is called the dependent variable. The variables useed to make the prediction are called independent variables. In addition to predicting values of the dependent variable, regression analysis also allows us to identify the type of mathematical relationship that exists between a dependent and an independent variable, to quantify the effect that changes in the independent variable have on the dependent variable, and to identify unusual observations. Regression analysis lets you use data to explain and predict.

Simple linear Regression is a single numerical independent variable, X, us used to predict the numerical dependent variable, Y. Simple regression’s linear is utilized to know variable independent’s influence (X) by dependent’s (Y) deep an equation model. Simple linear regression equation with least squares methods:

clip_image002

Where :

clip_image004 = Dependent Variable (sometimes reffered to as the response variable) for observation i.

clip_image006 = Y Intercept for the population

clip_image008 = Regression Coefficient or Slope for the population

clip_image010 = Independent Variable (sometimes reffered to as the explanatory variable) for observation i.

clip_image012 = random Error term in Y for observation i.

The slope of the line, clip_image002[4], represents the expected change in Y per unit change in X. It represents the mean amount that Y changes (either positively or negatively) for a one-unit change in X. The Y intercept, clip_image004[4],represents the mean value of Y when X equals 0. The last component of the model, clip_image006[4], is the vertical distance of the actual value of clip_image008[4]above or below the predicted value of clip_image008[5] on the line.

Linear Regression Equation

The predicted value of Y equals the Y intercept plus the slope times the value X

clip_image010[4]

Where :

clip_image012[4]= Estimate (or predicted) Y value for observation i

clip_image014 = Estimate of the regression intercept

clip_image016 = Estimate of the regression slope

clip_image018 = Value of X for observation i

Measures of Variation

When using the least-squares method to determine the regression coefficcients for a set of data, you need to compute three important measures of variation. The first measure, the total sum of squares (SST) is a measure of variation of the clip_image008[6] values around their mean clip_image020. In a regression analysis. The total variation or total sum of squares is subdivided into explained variation and unexplained variation. The explained variation or regression sum of squares (SSR) is due to the relationship between X and Y. And the unexplained variation, or error sum of squares (SSE) is due to factors other than the relationship between X and Y.

Total variation is made up of two parts:

clip_image022

Total Sum of Squares

clip_image024

Regression Sum of Squares

clip_image026

Error Sum of Squares

clip_image028

Where:

clip_image030 = Mean value of the dependent variable

Yi = Observed values of the dependent variable

i = Predicted value of Y for the given Xi value

Coefficient of Determination, r2

By themselves, SSR, SSE, and SST provide little information. However, the ratio of the regression sum of squares (SSR) to the total sum of squares (SST) measures the proportion of variation in Y that explained by the independent variable X in the regression model.This ratio is called the coefficient of determination. The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable.

The coefficient of determination is also called r-squared and is denoted as r2

clip_image032

clip_image034

Larger value of clip_image036 indicates a strong positive linear relationship between two variables because the use of a regression model has reduced the variability in predicting dependent variable.

Standard Error of the Estimate

Although the least-squares method results in the line that fits the data with minimum amount of error, unless all the observed data points fall on a straight line, the prediction line is not a perfect predictor. Just as all data values cannot be expected to be exactly equal to their mean, neither can they be expected to fall exactly on the prediction line. An important statistic, called the standard error of the estimate, measures the variability of the actual Y values from the predicted Y values in the same way that the standard deviation measures the variability of each value around the sample mean. In other words, the standard error of the estimate is the standard deviation around the prediction line, whereas the standard deviation is the standard deviation around the sample mean.

sɛ =clip_image038

where

Yi = actual value of Y for a given Xi

clip_image040 =predicted value of Y for a given Xi

SSE = error sum of squares

Inferences About the Slope: t Test

– t test for a population slope

o Is there a linear relationship between X and Y?

– Null and alternative hypotheses

o H0: β1 = 0 (no linear relationship)

o H1: β1 ≠ 0 (linear relationship does exist)

If reject the null hypothesis, we conclude that there is evidence of a linear realtionship.

– Test statistic

clip_image042

clip_image044

Where:

b1 = Regression slope coefficient

β1 = Hypothesized Slope

Sb1 = Standard error of the slope

Confidence Interval Estimate for the Slope

The confidence interval estimate for hte slope can be constructed by taking the sample slope, b1, and adding and substracting the critical t value multiplied by the standard error of the slope.

[clip_image046± tclip_image048 ; v= n-2 . Sb1]

Where :

tclip_image048[1] ; v= n-2 = Table’s Point

clip_image046[1] = Regression Coefficient

clip_image051 = Standard Error b1

clip_image053 = clip_image055

ANALYSIS

Scatter Diagram

clip_image002[6]

The Sample Regression Line

ŷ = b0 + b1x

ŷ = 0,00385x – 0,8828

The slope coefficient b1 is 0,00385, which means that for each additional 1 kwh on the monthly energy, the peak hour decreases by an average of 0,00385 kw. Expressed more simply, the slope tells us that for each additional kwh on the monthly energy the peak hour decreases on average by 0,00385 kw.

The intercept is b0 = -0,8828. Technically, the intercept is the point at which the regression line and the y-axis intersect. This means that when x = 0, the peak hour is 0,8828. However, in this case, the intercept is probably meaningless. Because our sample did not include any customers with zero kwh on the monthly energy, we have no basis for interpreting b0. As previously stated, we cannot determine the value of ŷ for a value of x that is far outside the range of the sample values of x. In this case, the smallest and largest values of x are 292 and 3560, respectively. Because x = 0 is not in this interval, we cannot safely interpret the value of ŷ when x = 0.

It is important to bear in mind that the interpretation of the coefficients pertains only to the sample, which consists of 50 observations. To infer information about the population, we need statistical inference techniques, which are described subsequently.

The Standard Error of Estimate

sɛ =clip_image004[6]

sɛ = 1,55886

The smallest value that sɛ can assume is 0, which occurs when SSE = 0, that is, when all the points fall on the regression line. Thus, when sɛ is small, the fit is excellent, and the linear model is likely to be an effective analytical and forecasting tool. If sɛ is large, the model is a poor one, and the statistics practitioner should improve it or discard it.

We judge the value of sɛ by comparing it to the values of the dependent variable y or more specifically to the sample mean clip_image006[6]. In this example, because sɛ = 1,55886 andclip_image006[7] = 3,4658, it does appear that the standard error of estimate is small. However, because there is no predefined upper limit on sɛ, it is often difficult to assess the model in this way. In general, the standard error of estimate cannot be used as an absolute measure of the model’s validity.

Nonetheless, sɛ is useful in comparing models. If the statistics practitioner has sev­eral models from which to choose, the one with the smallest value of sɛ should generally be the one used. And sɛ is also an important statistic in other procedures associated with regression analysis.

Test to Determine whether There is Enough Evidence to Infer that There is Linear Relationship between the Peak Hour Demand and the Total Monthly Energy for All Three-year-old

Testing the Slope Coefficient. Use 5% Significance Level

We test the hypotheses

clip_image008[10]

clip_image010[6]

If the null hypothesis is true, no linear relationship exists. If the alternative hypothesis is true, some linear relationship exists.

clip_image012[6]

Degree of freedom = 48

T critical value = 2,0106

b1 = 0,00385; SSX = 20104980,32; sb1 = 0,000347

t test = 0,00385 – 0

0,000347

t test = 11,095

clip_image014

The rejection region is

clip_image016[4]

t test < – 2,0106 or t test > 2,0106

The value of the test statistic is ttest 11,095 with a p-value of 0.056262193882634 clip_image018[4] 0. There is overwhelming evidence to infer that a linear relationship exists. What this means is that the total monthly energy may affect the peak hour demand.

As was the case when we interpreted the y-intercept, the conclusion we draw is valid only over the range of the values of the independent variable. That is, we can infer that there is a relationship between total monthly energy and peak hour demand for 3 year-old whose total monthly energy lie between 292 and 3560 kwh (the minimum and maximum values of x in the sample). Because we have no observations outside this range, we do not know how, or even whether, the two variables are related outside the range.

Notice that the printout includes a test for clip_image020[4]. However, as we pointed out before, interpreting the value of the y-intercept can lead to erroneous, if not ridiculous, conclusions. Consequently, we generally ignore the test of clip_image020[5].

We can also acquire information about the relationship by estimating the slope coefficient. In this case the 95% confidence interval estimate (approximatingg clip_image022[4]with 48 degrees of freedom with clip_image022[5] with 50 degrees of freedom) is

b1 ± ta/2sb1 = 0,00385 ± 2,0106 (0,000347)

= 0,00385 ± 0,000697

We estimate that the slope coefficient lies between 0,003153 and 0,004547.

Measuring the strength of the linear relationship between total monthly energy and peak hour demand

Coefficient of determination

R2 = 0.,7187

We found that R2 is equal to 0,7187. This statistic means that 71,87% of the variation in the peak hour demand is explained by the variation in the total monthly energy. The remaining 28,13% is unexplained. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In gen­eral, the higher the value of R2, the better the model fits the data. From the ttest ofclip_image024[4], we already know that there is evidence of a linear relationship. The coefficient of deter­mination merely supplies us with a measure of the strength of that relationship. Furthermore when we improve the model, the value of R2 increases.

Are the peak hour demand and the total monthly energy Linearly related? Testing coefficient of correlation.

The hypotheses to be tested are

clip_image026[4]

clip_image028[4]

Sample coefficient of correlation

r = 0,8477

The value of the test statistic is

t test = 11,095

The ttest of clip_image030[4] and the ttest of clip_image032[4] produced identical results. Both test (testing the slope coefficient and testing coefficient of correlation) are conducted to determine whether there is evidence of a linear relationship. The decision about which test to use us based on the type of experiment and the information we seek from the statistical analysis. If we’re interested in discovering the relationship between 2 variables, or if we’ve conducted an experiment where we controlled the values of the independent variable, the ttest of Bclip_image032[5] should be applied. If we are interested only in determining whether two random variables that are bivariate normally distributed are linearly related, the ttest of clip_image030[5] should be applied.

Predicting the peak hour demand and estimating the mean peak hour demand

Predict the mena number of peak hour demand when the total monthly energy is 1500 kwh, we need to calculate the confidence interval estimator of expected value.

ŷ = 0,00385x – 0,8828 = 0,00385 (1500) – 0,8828

ŷ = 4,8922

The point prediction does not provide any information about how closely the value will match the true peak hour demand. To discover that information, we use an interval. In fact, we can use one of two intervals which is the prediction interval of a particular value of y or the confidence interval estimator of the expected value of y.

Confidence Interval Estimate
Data
X Value 1500
Confidence Level 95%
Intermediate Calculations
Sample Size 50
Degrees of Freedom 48
t Value 2.010635
Sample Mean 1132.56
Sum of Squared Difference 20104980
Standard Error of the Estimate 1.558861
h Statistic 0.026715
Predicted Y (YHat) 4.892789
For Average Y
Interval Half Width 0.512296
Confidence Interval Lower Limit 4.380493
Confidence Interval Upper Limit 5.405085
For Individual Response Y
Interval Half Width 3.175891
Prediction Interval Lower Limit 1.716898
Prediction Interval Upper Limit 8.06868

The lower and upper limits of the confidence interval estimate of the expected value are 4,380493 and 5,405085. The lower and upper limits of the prediction interval are 1,716898 and 8,06868.

We predict that between1,716898 and 8,06868. The average peak hour demand of the population of 3-year–old is estimated to lie between 4,380493 and 5,405085. Because predicting the peak hour demand of one customer is more difficult than estimating the mean peak hour demand of all similar customers, the prediction interval is wider than the interval estimate of the expected value.

CONCLUSION

  • In this case we showed that the odometer reading is linearly related to the auction price. Although it seems reasonable to conclude that decreasing the odometer reading would cause the auction price to rise, the conclusion may not be entirely true. we can infer that there is a relationship between odometer reading and auction price for 3 year-old Ford Tauruses whose odometer readings lie between 19,1 (thousand) and 49,2 (thousand) miles (the minimum and maximum values of x in the sample). Because we have no observations outside this range, we do not know how, or even whether, the two variables are related outside the range. It is theoretically possible that the price is determined by the overall condition of the car and that the condition generally worsens whe the car is driven longer. Another analysis would be needed to establish the veracity of this conclusion.
  • A scatter plot and the line of best fit show the relationship between price and mileage
  • If we’re interested in discovering the relationship between 2 variables, or if we’ve conducted an experiment where we controlled the values of the independent variable, the ttest of Bclip_image002[8] should be applied. If we are interested only in determining whether two random variables that are bivariate normally distributed are linearly related, the ttest of clip_image004[8] should be applied.
  • If the model fits satisfactorily, we can use it to forecast and estimate the values of the dependent variable.
  • The point prediction does not provide any information about how closely the value will match the true selling price.
  • Observational data can be analyzed in another way. When the data are observational, both variables are random variables. We don’t need specify that one is independent and the other is dependent. We can simply determine whether the two variables are related or not.
    REFERENCE
    1. Levine, Stephen, Krehbiel, and Berenson. 2008. Statistics for Managers, Prentice Hall, 5th editon

2. Keller. 2005. Statistics for Management and Economics, THOMSON, 7th editon

 

Editor : Felix Yuwono, Natalyna Kosasih and Inggria Lestari

Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: