Sample regression coefficient y x. Sample regression equation

Two random variables can be related either by a functional dependence, or a statistical dependence, or be independent. A strict functional dependence is rarely realized, since both or one of the two quantities are also subject to the influence of random factors. Moreover, among these factors there may be some common to both quantities, i.e. affecting both random variables. In these cases, a statistical dependence arises.

Statistical is a dependence in which a change in one of the quantities entails a change in the distribution of the other. In particular, a change in one of the quantities causes a change in the average value of the other. In this case, the statistical dependence is called correlation. For example, the relationship between the amount of fertilizer and the harvest, between invested funds and profit.

The arithmetic mean of the observed values ​​of the random variable Y corresponding to the value X=x is called conditional mean x and is a point estimate of the mathematical expectation . The conditional mean y is determined similarly.

Conditional mathematical expectation M(Y|x) is a function of x, therefore, its assessment, i.e. conditional average x, also a function of x:

x = f*(x).

This equation is called sample regression equation of Y on X. Function f*(x) called sample regression, and its graph is sample regression line of Y on X. Similarly, Eq.

Y = φ * (y),

function φ * (y) and her schedule is called sample regression equation, sample regression and sample regression line of X on Y.

Finding Function Parameters f*(x) And φ * (y), if their type is known, assessing the closeness of the connection between the quantities X and Y is a problem correlation analysis. The task of regression analysis is to estimate the parameters of the regression function β i and the residual variance σ ost 2 .

Residual variance is that part of the dispersion Y that cannot be explained by the action of X. σ residual 2 can serve to assess the accuracy of the selection of the regression function and the completeness of the set of features included in the analysis. The type of dependence g(x) is chosen based on the nature of the correlation field and the nature of the process.



The estimate of the linear regression coefficient β is sample regression coefficient of Y on X r yx. Parameter values ​​r yx and parameter b straight line regression equations

Y = r yx x + b

are selected in such a way that the points (x 1 ,y 1), (x 2 ,y 2),…,(x n ,y n), constructed from observational data, on the xOy plane lie as close as possible to the straight regression line. This is equivalent to the requirement that the sum of squared deviations of the function Y(x i) from y i be minimal. This is the essence of MNCs.

The sample equation of a straight line regression of Y on X can be written as follows:

x –= r in s y /s x (x – ) ,

where s x and s y are sample standard deviations of X and Y, and

r in =

sample correlation coefficient calculated from grouped data. Here n xy is the frequency of the variant pair (x,y). Similarly, find the sample equation of the straight regression line X on Y:

Y – = r in s x /s y (y –)

In order to establish whether the mathematical model of the relationship between Y and X found in the sample corresponds to statistical data, one should evaluate the significance of the regression coefficients and the significance of the regression equation.

Testing the significance of regression coefficients means determining whether the magnitude of the estimate is sufficient to support a reasonable conclusion that the regression coefficient is different from zero. Hypothesis H 0 is put forward: the regression coefficient is equal to zero β =0. Hypothesis H 0 is tested using statistics distributed according to Student’s law

t = │b / s b │

Where b is the regression coefficient estimate, and s b– an estimate of its standard deviation, in other words, the standard error of the estimate. If │t │≥ t cr (α, k), the null hypothesis that the regression coefficient is equal to zero is rejected, and the coefficient is considered significant. At │t │< t кр нет оснований отвергать нулевую гипотезу.

LABORATORY WORK No. 4

Calculation of the sample correlation coefficient and construction of the empirical and theoretical regression line

Goal of the work : familiarization with linear correlation; developing the ability to calculate and sample the correlation coefficient and compile equations of theoretical regression lines.

The content of the work : based on experimental data, calculate the sample correlation coefficient, construct a confidence interval for it with reliability, give a semantic description of the result obtained, construct empirical and theoretical regression lines on
according to the prepositional method above.

Correlation method

Using the correlation method in mathematical statistics, the relationship between phenomena is determined. The peculiarity of studying this relationship is that it is impossible to isolate the influence of extraneous factors. Therefore, the correlation method is used in order to determine, in the case of a complex interaction of extraneous influences of factors, what the relationship between the characteristics would be if the extraneous factors did not change, i.e., the conditions of the experiment were adequate.

Correlation theory considers two problems:

1) determination of the correlation parameter between the examined characteristics;

2) determining the closeness of this connection. On the nature of the relationship between characteristics
And can be judged by the location of points in the coordinate system (correlation field). If these points are located near a straight line, then it is assumed that between the conditional average And
there is a linear relationship. The equation
on
.

The equation
called the regression line equation
on . If both regression lines are straight, then there is a linear correlation.

Regression Line Equations

And
are compiled on the basis of sample data given in the correlation table.

- average values ​​of the corresponding characteristics;

- regression coefficients on
And
on - calculated using formulas

Where
- average value of the product
on ;

And
- trait variances
And .

In straight-line correlation, the closeness of the relationship between characteristics is characterized by the sample correlation coefficient , which takes values ​​ranging from “-1” to “+1”.

If the value of the correlation coefficient is negative, then this indicates an inverse linear relationship between the characteristics being studied; if it is positive – about a rectilinear connection. If the correlation coefficient is 0, then there is no linear relationship between the characteristics.

The sample correlation coefficient is calculated using the formula:

r in
(1)

Where - average value of products
on

And - average values ​​of the corresponding characteristics;

And - standard deviations found for the characteristic
and for the sign .

METHOD OF PERFORMANCE OF THE WORK

Statistical data on the temperature of the lubricating oil of the rear axle of the car is given. depending on ambient temperature
.

1. CALCULATION OF SAMPLE CORRELATION COEFFICIENT

We will summarize these conditions in a correlation table

Table 1.

n y(frequency of characteristic y)

n x (frequency of characteristic x)

Let's find the numerical characteristics of the sample

1.1. Let's find the average values ​​of characteristics X and Y

,

1.2. Let's find sample variances

1513-1281,64=231,36

1.3. Sample standard deviation

,

,

1.4. Sample correlation moment

1/50(40 + 120+720+480+200+800+900+4200+1120+2160+4500+5280+4400+1320+1560) – 497,62=

1/50(27800) – 497,62 = 556 – 497,62 = 58,38

1.5. Sample correlation coefficient


0,77

2. Let’s check the significance of the correlation coefficient; to do this, let’s check the statistics:

=
≈ 8,3

We'll find
from the Student's distribution table (Appendix) according to the significance level most used in technology
And
Y– number of degrees of freedom K= n – 2 = 50 – 2 = 48,
2,02

Because
= 8.3 > 2.02, then the found correlation coefficient differs significantly from zero. This means that the variables X and Y are related by a linear regression relationship of the form

Thus, the correlation coefficient shows the close linear relationship that exists between the rear axle lubricating oil temperature and the ambient air temperature.

3. Drawing up empirical linear regression equationsYonXAndXonY.

3.1. Empirical linear regression equation of Y on X.

,

3.2. Empirical linear regression equation of X onY.

,

=35.8+2.34(y-13.9)

4. CONSTRUCTION OF AN EMPIRICAL REGRESSION LINEYONX.

To build an empirical regression line, let's draw up Table 2.

table 2

- conditional average of characteristic values provided that takes a certain value, i.e.

;

;

;

Taking pairs of numbers
for the coordinates of the points, construct them in a coordinate system and connect them with straight line segments. The resulting broken line will be the empirical regression line.

The equation of the theoretical straight line regression of Y on X is:

;
, Where - sample mean of attribute ;

- sample mean of attribute .

;
;
;
;
.

The direct regression equation of Y on X will be written as follows:

or finally

Let's build both regression lines (Fig. 1)

Rice. 1. Empirical and theoretical regression lines

at
;

at.

5. We will make a meaningful interpretation of the analysis results There is a close direct linear correlation between the temperature of the lubricating oil of the rear axle of the vehicle and the ambient air temperature ( r V

The equation
=0.77). This can be stated with a probability of 0.95.

characterizes how, on average, the temperature of the lubricating oil of the rear axle of a car depends on the ambient temperature.
Linear regression coefficient (

The equation
) suggests that if the ambient temperature is increased by an average of 1 degree, then the temperature of the lubricating oil of the rear axle of the car will increase by an average of 0.25 degrees.
)

characterizes how the temperature of the lubricating oil of the rear axle of a vehicle depends on the ambient temperature. If the temperature of the lubricating oil of the rear axle of a car needs to be increased by an average of 1 degree, then the ambient air temperature needs to be increased by an average of 2.34 degrees(

1. Distribution of X - cost of fixed production assets (million rubles) and Y - average monthly production per worker

2. The distribution of 200 cylindrical lamp posts by length X (in cm) and by weight Y (in kg) is given in the following table:

3. The distribution of 100 firms by means of production X (in monetary units) and by daily output Y (in tons) is given in the following table:

With a large number of trials, the same value X can occur nx times, the same value Y can occur ny times, and the same pair of numbers (x; y) can occur nxy times,

and usually the sample size.

Therefore, observational data is Grouped, i.e., nx, ny, nxy are calculated. All grouped data is recorded in the form of a table, which is called a correlation table.

If both regression lines of Y on X and X on Y are straight, then the correlation is linear.

The sample equation of the straight regression line Y on X has the form:

The parameters pyx and B, which are determined by the least squares method, have the form:

where yx is the conditional average; XВ and Ув are sample averages of characteristics X and Y; -x and -y are sample standard deviations; gV is the sample correlation coefficient.

The sample equation of the straight line regression of X on Y has the form:

We assume that observational data on characteristics X and Y are given in the form of a correlation table with equally spaced options.

Then we move on to the conditional options:

where C1 is the variant of trait X that has the highest frequency; C 2 - variant of trait Y, which has the highest frequency; h1 — step (difference between two adjacent options X); h2 - step (the difference between two adjacent options Y).

Then the sample correlation coefficient

The quantities u, v, su, sv can be found by the product method, or directly using the formulas

Knowing these quantities, we will find the parameters included in the regression equations using the formulas

TYPICAL CHECK WORK UNDER SECTION 6. 12.1. Random Events

12.1. Random Events

12.1.1. The box contains 6 identical pairs of black gloves and 4 identical pairs of beige gloves. Find the probability that two gloves drawn at random form a pair.

Consider event A—two gloves drawn at random form a pair; and hypotheses: B1 - a pair of black gloves was extracted, B2 - a pair of beige gloves was extracted, B3 - the extracted gloves do not form a pair.

The probability of hypothesis B1 by the multiplication theorem is equal to the product of the probabilities that the first glove is black and the second glove is black, i.e.

Similarly, the probability of hypothesis Bi is:

Since hypotheses B1, B2 and B3 constitute a complete group of events, the probability of hypothesis B3 is equal to:

According to the total probability formula, we have:

where Pb (A) is the probability that a pair is formed by two black gloves and Pb1 (A) = 1; pB1 (A) is the probability that two beige gloves form a pair and Pb2 (A) = 1; and, finally, РВз(A) - the probability that a pair is formed by gloves of different colors and

Thus, the probability that two gloves drawn at random form a pair is

12.1.2. The urn contains 3 white balls and 5 black balls. 3 balls are drawn at random, one at a time, and after each extraction they are returned to the urn. Find the probability that among the drawn balls there will be:

a) exactly two white balls, b) at least two white balls.

Solution. We have a scheme with return, i.e. each time the composition of the balls does not change:

a) when three balls are drawn, two of them must be white and one black. In this case, black can be either first, or second, or third. Applying the theorems of addition and multiplication of probabilities together, we have:

b) taking out at least two white balls means that there must be either two or three white balls:

12.1.3. The urn contains 6 white and 5 black balls. Three balls are drawn at random in succession without returning them to the urn. Find the probability that the third ball in a row will be white.

Solution. If the third ball must be white, then the first two balls can be white, or white and black, or black and white, or black, i.e. there are four groups of non-

joint events. Applying the probability multiplication theorem to them, we obtain:

P = P1(5 . P2(5 . P3(5 + (P1(5 . P2ch. P3(5 + P14 . P2(5 . P3(5) + P1ch. P2ch. P3(5 =

A A 4 A A 5 A A 5 A A 6=540 = A

P. 10. 9 + I. 10. 9 + I. 10. 9 + I. 10. 9 = 990 = IT

What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a two-dimensional scatter plot and say that we have linear relation, if the data is approximated by a straight line.

If we believe that y depends on x, and changes in y are caused precisely by changes in x, we can determine the regression line (regression y on x), which best describes the linear relationship between these two variables.

The statistical use of the word regression comes from the phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that although tall fathers tend to have tall sons, the average height of sons is shorter than that of their tall fathers. The average height of sons "regressed" and "moved backward" towards the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still quite tall) sons, and short fathers have taller (but still quite short) sons.

Regression line

A mathematical equation that estimates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the "predicted value" y»

  • a- free member (intersection) of the evaluation line; this is the meaning Y, When x=0(Fig.1).
  • b- slope or gradient of the estimated line; it represents the amount by which Y increases on average if we increase x per one unit.
  • a And b are called regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intercept a and the slope b (the amount Y increases as x increases by one unit)

Least square method

We perform regression analysis using a sample of observations where a And b- sample estimates of the true (general) parameters, α and β, which determine the linear regression line in the population (general population).

The simplest method for determining coefficients a And b is least square method(MNC).

The fit is assessed by looking at the residuals (the vertical distance of each point from the line, e.g. residual = observed y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the remainder is equal to the difference and the corresponding predicted value. Each remainder can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

  • The residuals are normally distributed with a mean of zero;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (for example, use a logarithmic transformation, etc.).

Anomalous values ​​(outliers) and influence points

An "influential" observation, if omitted, changes one or more model parameter estimates (ie, slope or intercept).

An outlier (an observation that is inconsistent with the majority of values ​​in a data set) can be an "influential" observation and can be easily detected visually by inspecting a bivariate scatterplot or residual plot.

Both for outliers and for “influential” observations (points), models are used, both with and without their inclusion, and attention is paid to changes in estimates (regression coefficients).

When conducting an analysis, you should not automatically discard outliers or influence points, since simply ignoring them can affect the results obtained. Always study the reasons for these outliers and analyze them.

Linear regression hypothesis

When constructing linear regression, the null hypothesis is tested that the general slope of the regression line β is equal to zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which is subject to a distribution with degrees of freedom, where the standard error of the coefficient


,

- estimation of the dispersion of the residuals.

Typically, if the significance level is reached, the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom, which gives the probability of a two-sided test

This is the interval that contains the general slope with a probability of 95%.

For large samples, say, we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Assessing the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as , and call it the variation that is due to or explained by regression. The residual variation should be as small as possible.

If this is true, then most of the variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well.

The proportion of total variance that is explained by regression is called coefficient of determination, usually expressed as a percentage and denoted R 2(in paired linear regression this is the quantity r 2, square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference represents the percentage of variance that cannot be explained by regression.

There is no formal test to evaluate; we must rely on subjective judgment to determine the goodness of fit of the regression line.

Applying a Regression Line to Forecast

You can use a regression line to predict a value from a value at the extreme end of the observed range (never extrapolate beyond these limits).

We predict the mean of observables that have a particular value by plugging that value into the equation of the regression line.

So, if we predict as Use this predicted value and its standard error to estimate a confidence interval for the true population mean.

Repeating this procedure for different values ​​allows you to construct confidence limits for this line. This is the band or area that contains the true line, for example at 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 observations with predictor values ​​P, such as 7, 4, and 9, and the design includes a first-order effect P, then the design matrix X will be

and the regression equation using P for X1 is

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P, such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-constrained and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the coding method chosen, the values ​​of the continuous variables are incremented accordingly and used as values ​​for the X variables. In this case, no recoding is performed. In addition, when describing regression plans, you can omit consideration of the design matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data presented in the table:

Rice. 3. Table of initial data.

Data compiled from a comparison of the 1960 and 1970 censuses in randomly selected 30 counties. County names are presented as observation names. Information regarding each variable is presented below:

Rice. 4. Table of variable specifications.

Research problem

For this example, the correlation between the poverty rate and the degree that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor) as the dependent variable.

We can put forward a hypothesis: changes in population size and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to out-migration, so there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng) as a predictor variable.

View results

Regression coefficients

Rice. 5. Regression coefficients of Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and the Param column.<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

the unstandardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374. This means that for every one unit decrease in population, there is an increase in poverty rate of .40374. The upper and lower (default) 95% confidence limits for this unstandardized coefficient do not include zero, so the regression coefficient is significant at the p level

Variable distribution

Correlation coefficients can become significantly overestimated or underestimated if large outliers are present in the data. Let's study the distribution of the dependent variable Pt_Poor by district. To do this, let's build a histogram of the variable Pt_Poor.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the two right columns) have a higher percentage of families that are below the poverty line than expected under a normal distribution, they appear to be "within the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be considered if the observation (or observations) do not fall within the interval (mean ± 3 times the standard deviation). In this case, it is worth repeating the analysis with and without outliers to ensure that they do not have a major effect on the correlation between population members.

If one of the hypotheses is a priori about the relationship between given variables, then it is useful to test it on the graph of the corresponding scatterplot.

Rice. 8. Scatter diagram.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., there is a 95% probability that the regression line lies between the two dotted curves.

Significance criteria

Rice. 9. Table containing significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Bottom line

This example showed how to analyze a simple regression design. Interpretations of unstandardized and standardized regression coefficients were also presented. The importance of studying the response distribution of a dependent variable is discussed, and a technique for determining the direction and strength of the relationship between a predictor and a dependent variable is demonstrated.



Did you like the article? Share with your friends!