Regression coefficient for variable x. Data Analysis Basics

Calculating Regression Equation Coefficients

The system of equations (7.8) based on the available ED cannot be solved unambiguously, since the number of unknowns is always greater than the number of equations. To overcome this problem, additional assumptions are needed. Common sense dictates: it is advisable to choose the coefficients of the polynomial in such a way as to ensure a minimum error in approximation of the ED. Various measures can be used to evaluate approximation errors. The root mean square error is widely used as such a measure. On its basis, a special method for estimating the coefficients of regression equations has been developed - the least squares method (LSM). This method allows you to obtain maximum likelihood estimates of the unknown coefficients of the regression equation under the normal distribution option, but it can be used for any other distribution of factors.

The MNC is based on the following provisions:

· the values ​​of error values ​​and factors are independent, and therefore uncorrelated, i.e. it is assumed that the mechanisms for generating interference are not related to the mechanism for generating factor values;

· the mathematical expectation of the error ε must be equal to zero (the constant component is included in the coefficient a 0), in other words, the error is a centered quantity;

· the sample estimate of the error variance should be minimal.

Let's consider the use of OLS in relation to linear regression of standardized values. For centered quantities u j coefficient a 0 is equal to zero, then the linear regression equations

. (7.9)

A special sign “^” has been introduced here to denote the indicator values ​​calculated using the regression equation, in contrast to the values ​​obtained from observational results.

Using the least squares method, such values ​​of the coefficients of the regression equation are determined that provide an unconditional minimum to the expression

The minimum is found by equating to zero all partial derivatives of expression (7.10), taken over unknown coefficients, and solving the system of equations

(7.11)

Consistently carrying out the transformations and using the previously introduced estimates of the correlation coefficients

. (7.12)

So, received T–1 linear equations, which allows you to uniquely calculate the values a 2 , a 3 , …, a t.

If the linear model is inaccurate or the parameters are measured inaccurately, then in this case the least squares method allows us to find such values ​​of the coefficients at which the linear model best describes the real object in the sense of the selected standard deviation criterion.

When there is only one parameter, the linear regression equation becomes

Coefficient a 2 is found from the equation

Then, given that r 2.2= 1, required coefficient

a 2 = r y ,2 . (7.13)

Relationship (7.13) confirms the previously stated statement that the correlation coefficient is a measure of the linear relationship between two standardized parameters.

Substituting the found value of the coefficient a 2 into an expression for w, taking into account the properties of centered and normalized quantities, we obtain the minimum value of this function equal to 1– r 2 y,2. Value 1– r 2 y,2 is called the residual variance of the random variable y relative to a random variable u 2. It characterizes the error that is obtained when replacing the indicator with a function of the parameter υ= a 2 u 2. Only with | r y,2| = 1 the residual variance is zero, and therefore there is no error when approximating the indicator with a linear function.

Moving on from centered and normalized indicator and parameter values

can be obtained for the original values

This equation is also linear with respect to the correlation coefficient. It is easy to see that centering and normalization for linear regression makes it possible to reduce the dimension of the system of equations by one, i.e. simplify the solution to the problem of determining the coefficients, and give the coefficients themselves a clear meaning.

The use of least squares for nonlinear functions is practically no different from the scheme considered (only the coefficient a0 in the original equation is not equal to zero).

For example, suppose it is necessary to determine the coefficients of parabolic regression

Sample error variance

Based on it, we can obtain the following system of equations

After transformations, the system of equations will take the form

Taking into account the properties of the moments of standardized quantities, we write

The determination of nonlinear regression coefficients is based on solving a system of linear equations. To do this, you can use universal packages of numerical methods or specialized packages for processing statistical data.

As the degree of the regression equation increases, so does the degree of distribution moments of the parameters used to determine the coefficients. Thus, to determine the coefficients of the regression equation of the second degree, the moments of the distribution of parameters up to the fourth degree inclusive are used. It is known that the accuracy and reliability of estimating moments from a limited sample of EDs sharply decreases as their order increases. The use of polynomials of degree higher than the second in regression equations is inappropriate.

The quality of the resulting regression equation is assessed by the degree of closeness between the results of observations of the indicator and the values ​​​​predicted by the regression equation at given points in the parameter space. If the results are close, then the problem of regression analysis can be considered solved. Otherwise, you should change the regression equation (choose a different degree of polynomial or a different type of equation altogether) and repeat the calculations to estimate the parameters.

If there are several indicators, the problem of regression analysis is solved independently for each of them.

Analyzing the essence of the regression equation, the following points should be noted. The considered approach does not provide separate (independent) assessment of coefficients - a change in the value of one coefficient entails a change in the values ​​of others. The obtained coefficients should not be considered as the contribution of the corresponding parameter to the value of the indicator. The regression equation is just a good analytical description of the existing ED, and not a law describing the relationship between the parameters and the indicator. This equation is used to calculate the values ​​of the indicator in a given range of parameter changes. It is of limited suitability for calculations outside this range, i.e. it can be used to solve interpolation problems and, to a limited extent, extrapolation.



The main reason for the inaccuracy of the forecast is not so much the uncertainty of extrapolation of the regression line, but rather the significant variation of the indicator due to factors not taken into account in the model. The limitation of the forecasting ability is the condition of stability of parameters not taken into account in the model and the nature of the influence of the model factors taken into account. If the external environment changes sharply, then the compiled regression equation will lose its meaning. It is impossible to substitute into the regression equation such values ​​of factors that differ significantly from those presented in the ED. It is recommended not to go beyond one third of the range of variation of the parameter for both the maximum and minimum values ​​of the factor.

The forecast obtained by substituting the expected value of the parameter into the regression equation is a point one. The likelihood of such a forecast being realized is negligible. It is advisable to determine the confidence interval of the forecast. For individual values ​​of the indicator, the interval should take into account errors in the position of the regression line and deviations of individual values ​​from this line. The average error in predicting indicator y for factor x will be

Where is the average error in the position of the regression line in the population at x = x k;

– assessment of the variance of the deviation of the indicator from the regression line in the population;

x k– expected value of the factor.

The confidence limits of the forecast, for example, for the regression equation (7.14), are determined by the expression

Negative free term a 0 in the regression equation for the original variables means that the domain of existence of the indicator does not include zero parameter values. If a 0 > 0, then the domain of existence of the indicator includes zero parameter values, and the coefficient itself characterizes the average value of the indicator in the absence of influences of the parameters.

Problem 7.2. Construct a regression equation for channel capacity based on the sample specified in table. 7.1.

Solution. In relation to the specified sample, the construction of the analytical dependence was mainly carried out within the framework of correlation analysis: the throughput depends only on the signal-to-noise ratio parameter. It remains to substitute the previously calculated parameter values ​​into expression (7.14). The equation for capacity will take the form

ŷ = 26.47–0.93×41.68×5.39/6.04+0.93×5.39/6.03× X = – 8,121+0,830X.

The calculation results are presented in table. 7.5.

Table 7.5

N pp Channel capacity Signal to noise ratio Function value Error
Y X ŷ ε
26.37 41.98 26.72 -0.35
28.00 43.83 28.25 -0.25
27/83 42.83 27.42 0.41
31.67 47.28 31.12 0.55
23.50 38.75 24.04 -0.54
21.04 35.12 21.03 0.01
16.94 32.07 18.49 -1.55
37.56 54.25 36.90 0.66
18.84 32.70 19.02 -0.18
25.77 40.51 25.50 0.27
33.52 49.78 33.19 0.33
28.21 43.84 28.26 -0.05
28.76 44.03

The study of correlation dependencies is based on the study of such connections between variables in which the values ​​of one variable, which can be taken as a dependent variable, “on average” change depending on the values ​​​​taken by another variable, considered as a cause in relation to the dependent variable. The action of this cause is carried out under conditions of complex interaction of various factors, as a result of which the manifestation of the pattern is obscured by the influence of chance. By calculating the average values ​​of the effective attribute for a given group of values ​​of the attribute-factor, the influence of chance is partly eliminated. By calculating the parameters of the theoretical communication line, they are further eliminated and an unambiguous (in form) change in “y” with a change in the factor “x” is obtained.

To study stochastic relationships, the method of comparing two parallel series, the method of analytical groupings, correlation analysis, regression analysis and some nonparametric methods are widely used. In general, the task of statistics in the field of studying relationships is not only to quantify their presence, direction and strength of connection, but also to determine the form (analytical expression) of the influence of factor characteristics on the resultant one. To solve it, methods of correlation and regression analysis are used.

CHAPTER 1. REGRESSION EQUATION: THEORETICAL FOUNDATIONS

1.1. Regression equation: essence and types of functions

Regression (lat. regressio - reverse movement, transition from more complex forms of development to less complex ones) is one of the basic concepts in probability theory and mathematical statistics, expressing the dependence of the average value of a random variable on the values ​​of another random variable or several random variables. This concept was introduced by Francis Galton in 1886.

The theoretical regression line is the line around which the points of the correlation field are grouped and which indicates the main direction, the main tendency of the connection.

The theoretical regression line should reflect the change in the average values ​​of the effective attribute “y” as the values ​​of the factor attribute “x” change, subject to the complete cancellation of all other – random in relation to the factor “x” - causes. Consequently, this line must be drawn so that the sum of deviations of the points of the correlation field from the corresponding points of the theoretical regression line is equal to zero, and the sum of the squares of these deviations is minimal.

y=f(x) - regression equation is a formula for the statistical relationship between variables.

A straight line on a plane (in two-dimensional space) is given by the equation y=a+b*x. In more detail, the variable y can be expressed in terms of a constant (a) and a slope (b) multiplied by the variable x. The constant is sometimes also called the intercept term, and the slope is sometimes called the regression or B-coefficient.

An important stage of regression analysis is determining the type of function with which the dependence between characteristics is characterized. The main basis should be a meaningful analysis of the nature of the dependence being studied and its mechanism. At the same time, it is not always possible to theoretically substantiate the form of connection between each of the factors and the performance indicator, since the socio-economic phenomena under study are very complex and the factors that shape their level are closely intertwined and interact with each other. Therefore, on the basis of theoretical analysis, the most general conclusions can often be drawn regarding the direction of the relationship, the possibility of its change in the population under study, the legitimacy of using a linear relationship, the possible presence of extreme values, etc. A necessary complement to such assumptions should be an analysis of specific factual data.

An approximate idea of ​​the relationship line can be obtained based on the empirical regression line. The empirical regression line is usually a broken line and has a more or less significant break. This is explained by the fact that the influence of other unaccounted factors that influence the variation of the resulting characteristic is incompletely extinguished in the average, due to the insufficiently large number of observations, therefore, an empirical line of communication can be used to select and justify the type of theoretical curve, provided that the number of observations is sufficient great.

One of the elements of specific studies is the comparison of various dependence equations, based on the use of quality criteria for approximating empirical data by competing versions of models. The following types of functions are most often used to characterize the relationships of economic indicators:

1. Linear:

2. Hyperbolic:

3. Demonstrative:

4. Parabolic:

5. Power:

6. Logarithmic:

7. Logistics:

A model with one explanatory and one explained variable is a paired regression model. If two or more explanatory (factor) variables are used, then we speak of using a multiple regression model. In this case, linear, exponential, hyperbolic, exponential and other types of functions connecting these variables can be selected as options.

To find parameters a and b of the regression equation, the least squares method is used. When applying the least squares method to find a function that best fits empirical data, it is believed that the bag of squares of deviations of empirical points from the theoretical regression line should be a minimum value.

The least squares criterion can be written as follows:

Consequently, the use of the least squares method to determine the parameters a and b of the line that best matches the empirical data is reduced to an extremum problem.

Regarding the assessments, the following conclusions can be drawn:

1. Least squares estimators are functions of the sample, making them easy to calculate.

2. Least squares estimates are point estimates of the theoretical regression coefficients.

3. The empirical regression line necessarily passes through the point x, y.

4. The empirical regression equation is constructed in such a way that the sum of deviations

.

A graphical representation of the empirical and theoretical line of communication is presented in Figure 1.


The parameter b in the equation is the regression coefficient. In the presence of a direct correlation, the regression coefficient is positive, and in the case of an inverse correlation, the regression coefficient is negative. The regression coefficient shows how much on average the value of the effective attribute “y” changes when the factor attribute “x” changes by one. Geometrically, the regression coefficient is the slope of the straight line depicting the correlation equation relative to the “x” axis (for the equation

).

The branch of multivariate statistical analysis devoted to dependency recovery is called regression analysis. The term “linear regression analysis” is used when the function under consideration linearly depends on the estimated parameters (the dependence on independent variables can be arbitrary). Assessment theory

unknown parameters is well developed specifically in the case of linear regression analysis. If there is no linearity and it is impossible to move to a linear problem, then, as a rule, one cannot expect good properties from the estimates. We will demonstrate approaches in the case of dependencies of various types. If the dependence has the form of a polynomial (polynomial). If the calculation of correlation characterizes the strength of the relationship between two variables, then regression analysis serves to determine the type of this relationship and makes it possible to predict the value of one (dependent) variable based on the value of another (independent) variable. To conduct linear regression analysis, the dependent variable must have an interval (or ordinal) scale. At the same time, binary logistic regression reveals the dependence of a dichotomous variable on some other variable related to any scale. The same application conditions apply to probit analysis. If the dependent variable is categorical but has more than two categories, then multinomial logistic regression is a suitable method; non-linear relationships between variables that belong to an interval scale can be analyzed. The nonlinear regression method is designed for this purpose.

Regression coefficients show the intensity of the influence of factors on the performance indicator. If preliminary standardization of factor indicators is carried out, then b 0 is equal to the average value of the effective indicator in the aggregate. Coefficients b 1, b 2, ..., b n show by how many units the level of the effective indicator deviates from its average value if the values ​​of the factor indicator deviate from the average of zero by one standard deviation. Thus, regression coefficients characterize the degree of significance of individual factors for increasing the level of the performance indicator. Specific values ​​of regression coefficients are determined from empirical data according to the least squares method (as a result of solving systems of normal equations).

Regression line- a line that most accurately reflects the distribution of experimental points on the scatter diagram and the steepness of the slope of which characterizes the relationship between two interval variables.

The regression line is most often sought in the form of a linear function (linear regression), which best approximates the desired curve. This is done using the least squares method, when the sum of the squared deviations of the actually observed ones from their estimates is minimized (meaning estimates using a straight line that purports to represent the desired regression relationship):

(M - sample size). This approach is based on the well-known fact that the amount appearing in the above expression takes on a minimum value precisely for the case when .
57. Main tasks of correlation theory.

Correlation theory is an apparatus that evaluates the closeness of connections between phenomena that are not only in cause-and-effect relationships. Using correlation theory, stochastic, but not causal, relationships are assessed. The author, together with M. L. Lukatskaya, made an attempt to obtain estimates for causal relationships. However, the question of the cause-and-effect relationships of phenomena, of how to identify cause and effect, remains open, and it seems that at the formal level it is fundamentally unsolvable.

The theory of correlation and its application to production analysis.

Correlation theory, which is one of the branches of mathematical statistics, allows one to make reasonable assumptions about the possible limits within which, with a certain degree of reliability, the parameter under study will be located if other statistically related parameters receive certain values.

In correlation theory, it is customary to distinguish two main tasks.

First task correlation theory - to establish the form of correlation, i.e. type of regression function (linear, quadratic, etc.).

Second task correlation theory - assess the closeness (strength) of the correlation connection.

The closeness of the correlation connection (dependence) of Y on X is assessed by the amount of dispersion of the Y values ​​around the conditional average. Large dispersion indicates a weak dependence of Y on X, small dispersion indicates the presence of a strong dependence.
58. Correlation table and its numerical characteristics.

In practice, as a result of independent observations of the quantities X and Y, as a rule, one deals not with the entire set of all possible pairs of values ​​of these quantities, but only with a limited sample from the general population, and the volume n of the sample population is defined as the number of pairs available in the sample.

Let the value X in the sample take the values ​​x 1, x 2,....x m, where the number of values ​​of this value that differ from each other, and in the general case, each of them can be repeated in the sample. Let the value Y in the sample take the values ​​y 1, y 2,....y k, where k is the number of different values ​​of this value, and in the general case, each of them can also be repeated in the sample. In this case, the data is entered into a table taking into account the frequency of occurrence. Such a table with grouped data is called a correlation table.

The first stage of statistical processing of the results is the compilation of a correlation table.

Y\X x 1 x 2 ... x m n y
y 1 n 12 n 21 n m1 n y1
y 2 n 22 n m2 n y2
...
y k n 1k n 2k n mk n yk
n x n x1 n x2 n xm n

The first row of the main part of the table lists in ascending order all the values ​​of the quantity X found in the sample. The first column also lists in ascending order all the values ​​of the quantity Y found in the sample. At the intersection of the corresponding rows and columns, frequencies n ij (i = 1,2 ,...,m; j=1,2,...,k) equal to the number of occurrences of the pair (x i ;y i) in the sample. For example, frequency n 12 represents the number of occurrences of the pair (x 1 ;y 1) in the sample.

Also n xi n ij , 1≤i≤m, is the sum of the elements of the i-th column, n yj n ij , 1≤j≤k, is the sum of the elements of the j-th row and n xi = n yj =n

Analogues of the formulas obtained from the correlation table data have the form:


59. Empirical and theoretical regression lines.

Theoretical regression line can be calculated in this case from the results of individual observations. To solve a system of normal equations, we need the same data: x, y, xy and xr. We have data on the volume of cement production and the volume of fixed production assets in 1958. The task is set: to study the relationship between the volume of cement production (in physical terms) and the volume of fixed assets. [ 1 ]

The less the theoretical regression line (calculated from the equation) deviates from the actual (empirical) one, the smaller the average approximation error.

The process of finding a theoretical regression line involves fitting the empirical regression line using the least squares method.

The process of finding a theoretical regression line is called alignment of the empirical regression line and consists of selecting and justifying the type; curve and calculation of the parameters of its equation.

Empirical regression is built according to analytical or combinational grouping data and represents the dependence of the group average values ​​of the result trait on the group average values ​​of the factor trait. The graphical representation of empirical regression is a broken line made up of points, the abscissas of which are the group average values ​​of the factor trait, and the ordinates are the group average values ​​of the result trait. The number of points is equal to the number of groups in the grouping.

The empirical regression line reflects the main trend of the relationship under consideration. If the empirical regression line approaches a straight line in appearance, then we can assume the presence of a linear correlation between the characteristics. And if the connection line approaches the curve, then this may be due to the presence of a curvilinear correlation relationship.
60. Sample correlation and regression coefficients.

If the dependence between the characteristics on the graph indicates a linear correlation, calculate correlation coefficient r, which allows you to assess the closeness of the relationship between variables, and also find out what proportion of changes in a characteristic is due to the influence of the main characteristic, and what part is due to the influence of other factors. The coefficient varies from –1 to +1. If r=0, then there is no connection between the characteristics. Equality r=0 only indicates the absence of a linear correlation dependence, but not the absence of a correlation at all, much less a statistical dependence. If r= ±1, then this means the presence of a complete (functional) connection. In this case, all observed values ​​are located on the regression line, which is a straight line.
The practical significance of the correlation coefficient is determined by its squared value, called the coefficient of determination.
Regression approximated (approximately described) by a linear function y = kX + b. For the regression of Y on X, the regression equation is: `y x = ryx X + b; (1). The slope ryx of the direct regression of Y on X is called the regression coefficient of Y on X.

If equation (1) is found using sample data, then it is called sample regression equation. Accordingly, ryx is the sample regression coefficient of Y on X, and b is the sample dummy term of the equation. The regression coefficient measures the variation in Y per unit variation in X. The parameters of the regression equation (coefficients ryx and b) are found using the least squares method.
61. Assessing the significance of the correlation coefficient and the closeness of the correlation in the general population

Significance of correlation coefficients checked using Student's test:

Where - root mean square error of the correlation coefficient, which is determined by the formula:

If the calculated value is higher than the table value, then we can conclude that the value of the correlation coefficient is significant. Table values t found from the table of Student's t-test values. In this case, the number of degrees of freedom is taken into account (V = n - 1) and the confidence level (in economic calculations, usually 0.05 or 0.01). In our example, the number of degrees of freedom is: p - 1 = 40 - 1 = 39. At the confidence level R = 0,05; t= 2.02. Since (the actual value in all cases is higher than the t-tabular one), the relationship between the resultant and factor indicators is reliable, and the magnitude of the correlation coefficients is significant.

Estimation of the correlation coefficient, calculated from a limited sample, almost always differs from zero. But this does not mean that the correlation coefficient population is also different from zero. It is required to evaluate the significance of the sample value of the coefficient or, in accordance with the formulation of the tasks of testing statistical hypotheses, to test the hypothesis that the correlation coefficient is equal to zero. If the hypothesis N 0 that the correlation coefficient is equal to zero will be rejected, then the sample coefficient is significant, and the corresponding values ​​are related by a linear relationship. If the hypothesis N 0 will be accepted, then the coefficient estimate is not significant, and the values ​​are not linearly related to each other (if, for physical reasons, the factors can be related, then it is better to say that this relationship has not been established based on the available ED). Testing the hypothesis about the significance of the correlation coefficient estimate requires knowledge of the distribution of this random variable. Distribution of  value ik studied only for the special case when random variables U j And U k distributed according to the normal law.

As a criterion for testing the null hypothesis N 0 apply random variable . If the modulus of the correlation coefficient is relatively far from unity, then the value t if the null hypothesis is true, it is distributed according to Student’s law with n– 2 degrees of freedom. Competing hypothesis N 1 corresponds to the statement that the value  ik not equal to zero (greater or less than zero). Therefore, the critical region is two-sided.
62. Calculation of the sample correlation coefficient and construction of the sample straight line regression equation.

Sample correlation coefficient is found by the formula

where are sample standard deviations of values ​​and .

The sample correlation coefficient shows the closeness of the linear relationship between and : the closer to unity, the stronger the linear relationship between and .

Simple linear regression finds a linear relationship between one input variable and one output variable. To do this, a regression equation is determined - this is a model that reflects the dependence of the values ​​of Y, the dependent value of Y on the values ​​of x, the independent variable x and the population, described by leveling:

Where A0- free term of the regression equation;

A1- regression equation coefficient

Then a corresponding straight line is constructed, called a regression line. Coefficients A0 and A1, also called model parameters, are selected in such a way that the sum of the squared deviations of the points corresponding to real data observations from the regression line is minimal. The coefficients are selected using the least squares method. In other words, simple linear regression describes a linear model that best approximates the relationship between one input variable and one output variable.

Concept of regression. Dependence between variables x And y can be described in different ways. In particular, any form of connection can be expressed by a general equation, where y treated as a dependent variable, or functions from another - independent variable x, called argument. The correspondence between an argument and a function can be specified by a table, formula, graph, etc. Changing a function depending on a change in one or more arguments is called regression. All means used to describe correlations constitute the content regression analysis.

To express regression, correlation equations, or regression equations, empirical and theoretically calculated regression series, their graphs, called regression lines, as well as linear and nonlinear regression coefficients are used.

Regression indicators express the correlation relationship bilaterally, taking into account changes in the average values ​​of the characteristic Y when changing values x i sign X, and, conversely, show a change in the average values ​​of the characteristic X according to changed values y i sign Y. The exception is time series, or time series, showing changes in characteristics over time. The regression of such series is one-sided.

There are many different forms and types of correlations. The task boils down to identifying the form of the connection in each specific case and expressing it with the appropriate correlation equation, which allows us to anticipate possible changes in one characteristic Y based on known changes in another X, related to the first correlationally.

12.1 Linear regression

Regression equation. Results of observations carried out on a particular biological object based on correlated characteristics x And y, can be represented by points on a plane by constructing a system of rectangular coordinates. The result is a kind of scatter diagram that allows one to judge the form and closeness of the relationship between varying characteristics. Quite often this relationship looks like a straight line or can be approximated by a straight line.

Linear relationship between variables x And y is described by a general equation, where a, b, c, d,... – parameters of the equation that determine the relationships between the arguments x 1 , x 2 , x 3 , …, x m and functions.

In practice, not all possible arguments are taken into account, but only some arguments; in the simplest case, only one:

In the linear regression equation (1) a is the free term, and the parameter b determines the slope of the regression line relative to the rectangular coordinate axes. In analytical geometry this parameter is called slope, and in biometrics – regression coefficient. A visual representation of this parameter and the position of the regression lines Y By X And X By Y in the rectangular coordinate system gives Fig. 1.

Rice. 1 Regression lines of Y by X and X by Y in the system

rectangular coordinates

Regression lines, as shown in Fig. 1, intersect at point O (,), corresponding to the arithmetic average values ​​of features correlated with each other Y And X. When constructing regression graphs, the values ​​of the independent variable X are plotted along the abscissa axis, and the values ​​of the dependent variable, or function Y, are plotted along the ordinate axis. Line AB passing through point O (,) corresponds to the complete (functional) relationship between the variables Y And X, when the correlation coefficient . The stronger the connection between Y And X, the closer the regression lines are to AB, and, conversely, the weaker the connection between these quantities, the more distant the regression lines are from AB. If there is no connection between the characteristics, the regression lines are at right angles to each other and .

Since regression indicators express the correlation relationship bilaterally, regression equation (1) should be written as follows:

The first formula determines the average values ​​when the characteristic changes X per unit of measure, for the second - average values ​​when changing by one unit of measure of the attribute Y.

Regression coefficient. The regression coefficient shows how much on average the value of one characteristic y changes when the measure of another, correlated with, changes by one Y sign X. This indicator is determined by the formula

Here are the values s multiplied by the size of class intervals λ , if they were found from variation series or correlation tables.

The regression coefficient can be calculated without calculating standard deviations s y And s x according to the formula

If the correlation coefficient is unknown, the regression coefficient is determined as follows:

Relationship between regression and correlation coefficients. Comparing formulas (11.1) (topic 11) and (12.5), we see: their numerator has the same value, which indicates a connection between these indicators. This relationship is expressed by the equality

Thus, the correlation coefficient is equal to the geometric mean of the coefficients b yx And b xy. Formula (6) allows, firstly, based on the known values ​​of the regression coefficients b yx And b xy determine the regression coefficient R xy, and secondly, check the correctness of the calculation of this correlation indicator R xy between varying characteristics X And Y.

Like the correlation coefficient, the regression coefficient characterizes only a linear relationship and is accompanied by a plus sign for a positive relationship and a minus sign for a negative relationship.

Determination of linear regression parameters. It is known that the sum of squared deviations is a variant x i from the average is the smallest value, i.e. This theorem forms the basis of the least squares method. Regarding linear regression [see formula (1)] the requirement of this theorem is satisfied by a certain system of equations called normal:

Joint solution of these equations with respect to parameters a And b leads to the following results:

;

;

, from where and.

Considering the two-way nature of the relationship between the variables Y And X, formula for determining the parameter A should be expressed like this:

And . (7)

Parameter b, or regression coefficient, is determined by the following formulas:

Construction of empirical regression series. If there are a large number of observations, regression analysis begins with the construction of empirical regression series. Empirical regression series is formed by calculating the values ​​of one varying characteristic X average values ​​of another, correlated with X sign Y. In other words, the construction of empirical regression series comes down to finding group averages from the corresponding values ​​of characteristics Y and X.

An empirical regression series is a double series of numbers that can be represented by points on a plane, and then, by connecting these points with straight line segments, an empirical regression line can be obtained. Empirical regression series, especially their graphs, called regression lines, give a clear idea of ​​the form and closeness of the correlation between varying characteristics.

Alignment of empirical regression series. Graphs of empirical regression series turn out, as a rule, not to be smooth, but broken lines. This is explained by the fact that, along with the main reasons that determine the general pattern in the variability of correlated characteristics, their magnitude is affected by the influence of numerous secondary reasons that cause random fluctuations in the nodal points of regression. To identify the main tendency (trend) of the conjugate variation of correlated characteristics, it is necessary to replace broken lines with smooth, smoothly running regression lines. The process of replacing broken lines with smooth ones is called alignment of empirical series And regression lines.

Graphical alignment method. This is the simplest method and does not require any computational work. Its essence boils down to the following. The empirical regression series is depicted as a graph in a rectangular coordinate system. Then the midpoints of regression are visually outlined, along which a solid line is drawn using a ruler or pattern. The disadvantage of this method is obvious: it does not exclude the influence of the individual properties of the researcher on the results of aligning empirical regression lines. Therefore, in cases where higher accuracy is needed when replacing broken regression lines with smooth ones, other methods of aligning empirical series are used.

Moving average method. The essence of this method comes down to the sequential calculation of arithmetic averages from two or three adjacent terms of the empirical series. This method is especially convenient in cases where the empirical series is represented by a large number of terms, so that the loss of two of them - the extreme ones, which is inevitable with this method of alignment, will not noticeably affect its structure.

Least squares method. This method was proposed at the beginning of the 19th century by A.M. Legendre and, independently of him, K. Gauss. It allows you to most accurately align empirical series. This method, as shown above, is based on the assumption that the sum of squared deviations is an option x i from their average there is a minimum value, i.e. Hence the name of the method, which is used not only in ecology, but also in technology. The least squares method is objective and universal; it is used in a wide variety of cases when finding empirical equations for regression series and determining their parameters.

The requirement of the least squares method is that the theoretical points of the regression line must be obtained in such a way that the sum of the squared deviations from these points for the empirical observations y i was minimal, i.e.

By calculating the minimum of this expression in accordance with the principles of mathematical analysis and transforming it in a certain way, one can obtain a system of so-called normal equations, in which the unknown values ​​are the required parameters of the regression equation, and the known coefficients are determined by the empirical values ​​of the characteristics, usually the sums of their values ​​and their cross products.

Multiple linear regression. The relationship between several variables is usually expressed by a multiple regression equation, which can be linear And nonlinear. In its simplest form, multiple regression is expressed as an equation with two independent variables ( x, z):

Where a– free term of the equation; b And c– parameters of the equation. To find the parameters of equation (10) (using the least squares method), the following system of normal equations is used:

Dynamic series. Alignment of rows. Changes in characteristics over time form the so-called time series or dynamics series. A characteristic feature of such series is that the independent variable X here is always the time factor, and the dependent variable Y is a changing feature. Depending on the regression series, the relationship between the variables X and Y is one-sided, since the time factor does not depend on the variability of the characteristics. Despite these features, dynamics series can be likened to regression series and processed using the same methods.

Like regression series, empirical series of dynamics bear the influence of not only the main, but also numerous secondary (random) factors that obscure the main trend in the variability of characteristics, which in the language of statistics is called trend.

Analysis of time series begins with identifying the shape of the trend. To do this, the time series is depicted as a line graph in a rectangular coordinate system. In this case, time points (years, months and other units of time) are plotted along the abscissa axis, and the values ​​of the dependent variable Y are plotted along the ordinate axis. If there is a linear relationship between the variables X and Y (linear trend), the least squares method is the most appropriate for aligning the time series is a regression equation in the form of deviations of the terms of the series of the dependent variable Y from the arithmetic mean of the series of the independent variable X:

Here is the linear regression parameter.

Numerical characteristics of dynamics series. The main generalizing numerical characteristics of dynamics series include geometric mean and an arithmetic mean close to it. They characterize the average rate at which the value of the dependent variable changes over certain periods of time:

An assessment of the variability of members of the dynamics series is standard deviation. When choosing regression equations to describe time series, the shape of the trend is taken into account, which can be linear (or reduced to linear) and nonlinear. The correctness of the choice of regression equation is usually judged by the similarity of the empirically observed and calculated values ​​of the dependent variable. A more accurate solution to this problem is the regression analysis of variance method (topic 12, paragraph 4).

Correlation of time series. It is often necessary to compare the dynamics of parallel time series related to each other by certain general conditions, for example, to find out the relationship between agricultural production and the growth of livestock numbers over a certain period of time. In such cases, the characteristic of the relationship between variables X and Y is correlation coefficient R xy (in the presence of a linear trend).

It is known that the trend of time series is, as a rule, obscured by fluctuations in the series of the dependent variable Y. This gives rise to a twofold problem: measuring the dependence between compared series, without excluding the trend, and measuring the dependence between neighboring members of the same series, excluding the trend. In the first case, the indicator of the closeness of the connection between the compared time series is correlation coefficient(if the relationship is linear), in the second – autocorrelation coefficient. These indicators have different meanings, although they are calculated using the same formulas (see topic 11).

It is easy to see that the value of the autocorrelation coefficient is affected by the variability of the series members of the dependent variable: the less the series members deviate from the trend, the higher the autocorrelation coefficient, and vice versa.

With a linear type of relationship between the two characteristics being studied, in addition to calculating correlations, the calculation of the regression coefficient is used.

In the case of a linear correlation, each change in one characteristic corresponds to a very definite change in another characteristic. However, the correlation coefficient shows this relationship only in relative quantities - in fractions of unity. With the help of regression analysis, this relationship value is obtained in named units. The amount by which the first characteristic changes on average when the second one changes by a unit of measurement is called the regression coefficient.

Unlike correlation regression analysis, it provides broader information, since by calculating two regression coefficients Rx/y And Rу/х It is possible to determine both the dependence of the first sign on the second, and the second on the first. Expressing a regression relationship using an equation allows one to determine the value of another characteristic based on a certain value of one characteristic.

The regression coefficient R is the product of the correlation coefficient and the ratio of the square deviations calculated for each characteristic. It is calculated according to the formula

where, R - regression coefficient; SH is the standard deviation of the first characteristic, which changes due to a change in the second; SУ - standard deviation of the second characteristic in connection with the change of which the first characteristic changes; r is the correlation coefficient between these characteristics; x - function; y is an argument.

This formula determines the value of x when y changes by a unit of measurement. If reverse calculation is necessary, you can find the value of y when x changes by unit of measurement using the formula:


In this case, the active role in changing one characteristic in relation to another changes; compared to the previous formula, the argument becomes a function and vice versa. The values ​​of SX and SY are taken in a named expression.

There is a clear relationship between the values ​​of r and R, which is expressed in the fact that the product of the regression of x on y by the regression of y on x is equal to the square of the correlation coefficient, i.e.

Rx/y * Ry/x = r2

This indicates that the correlation coefficient represents the geometric mean of both values ​​of the regression coefficients of a given sample. This formula can be used to check the accuracy of the calculations.

When processing digital material on calculating machines, detailed regression coefficient formulas can be used:

R or


For a regression coefficient, its representativeness error can be calculated. The error of the regression coefficient is equal to the error of the correlation coefficient multiplied by the ratio of the quadratic ratios:

The reliability criterion for the regression coefficient is calculated using the usual formula:

as a result, it is equal to the reliability criterion of the correlation coefficient:

The reliability of the tR value is established using the Student's table at  = n - 2, where n is the number of pairs of observations.

Curvilinear regression.

REGRESSION, CURVILINEAR. Any nonlinear regression in which the regression equation for changes in one variable (y) as a function of t changes in another (x) is a quadratic, cubic, or higher order equation. Although it is always mathematically possible to obtain a regression equation that will fit each "squiggle" of the curve, most of these perturbations arise from sampling or measurement errors, and such a "perfect" fit does not achieve anything. It is not always easy to determine whether a curvilinear regression fits a data set, although there are statistical tests to determine whether each higher power of the equation significantly increases the degree of fit of that data set.

Curve fitting is performed in the same least squares manner as straight line fitting. The regression line must satisfy the condition of minimum sum of squared distances to each point of the correlation field. In this case, in equation (1), y represents the calculated value of the function, determined using the equation of the selected curvilinear relationship based on the actual values ​​of x j. For example, if a second-order parabola is chosen to approximate the connection, then y = a + b x + cx2, (14). And the difference between a point lying on the curve and a given point in the correlation field with an appropriate argument can be written similarly to equation (3) in the form yj = yj (a + bx + cx2) (15) In this case, the sum of squared distances from each point of the correlation field to the new regression line in the case of a second-order parabola will have the form: S 2 = yj 2 = 2 (16) Based on the minimum condition of this sum, the partial derivatives of S 2 with respect to a, b and c are equal to zero. Having performed the necessary transformations, we obtain a system of three equations with three unknowns to determine a, b and c. , y = m a + b x + c x 2 yx = a x + b x 2 + c x 2. yx2 = a x 2 + b x 3 + c x4. (17). By solving the system of equations for a, b and c, we find the numerical values ​​of the regression coefficients. The values ​​y, x, x2, yx, yx2, x3, x4 are found directly from production measurement data. An assessment of the closeness of the connection for a curvilinear dependence is the theoretical correlation ratio xy, which is the square root of the ratio of two dispersions: the mean square p2 of the deviations of the calculated values ​​y" j of the function according to the found regression equation from the arithmetic mean value Y of the value y to the mean square deviations y2 of the actual values ​​of the function y j from its arithmetic mean value: xу = ( р2 / y2 ) 1/2 = ( (y" j - Y)2 / (y j - Y)2 ) 1/2 (18) The square of the correlation ratio xy2 shows the share of the total variability of the dependent variable y , due to the variability of the argument x. This indicator is called the coefficient of determination. In contrast to the correlation coefficient, the value of the correlation ratio can only take positive values ​​from 0 to 1. In the complete absence of a connection, the correlation ratio is equal to zero, in the presence of a functional connection it is equal to one, and in the presence of a regression connection of varying closeness, the correlation ratio takes values ​​between zero and one . The choice of the type of curve is of great importance in regression analysis, since the accuracy of the approximation and statistical estimates of the closeness of the relationship depend on the type of relationship chosen. The simplest method for selecting the type of curve is to construct correlation fields and select the appropriate types of regression equations based on the location of points on these fields. Regression analysis methods make it possible to find numerical values ​​of regression coefficients for complex types of relationships between parameters, described, for example, by polynomials of high degrees. Often the shape of the curve can be determined based on the physical nature of the process or phenomenon under consideration. It makes sense to use polynomials of high degrees to describe rapidly changing processes if the limits of fluctuation of the parameters of these processes are significant. In relation to studies of the metallurgical process, it is sufficient to use lower-order curves, for example, a second-order parabola. This curve may have one extremum, which, as practice has shown, is quite sufficient to describe various characteristics of the metallurgical process. The results of calculations of the parameters of the paired correlation relationship would be reliable and would be of practical value if the information used was obtained for conditions of wide limits of argument fluctuations with all other process parameters being constant. Consequently, methods for studying the pairwise correlation of parameters can be used to solve practical problems only when there is confidence in the absence of other serious influences on the function other than the analyzed argument. Under production conditions, it is impossible to carry out the process in this way for a long time. However, if we have information about the main parameters of the process that influence its results, then mathematically we can exclude the influence of these parameters and isolate in “pure form” the relationship between the function and the argument that interests us. Such a connection is called private, or individual. To determine it, the multiple regression method is used.

Correlation relationship.

The correlation ratio and the correlation index are numerical characteristics that are closely related to the concept of a random variable, or more precisely, to a system of random variables. Therefore, to introduce and define their meaning and role, it is necessary to explain the concept of a system of random variables and some properties inherent in them.

Two or more random variables that describe a certain phenomenon are called a system or complex of random variables.

A system of several random variables X, Y, Z, …, W is usually denoted by (X, Y, Z, …, W).

For example, a point on a plane is described not by one coordinate, but by two, and in space - even by three.

The properties of a system of several random variables are not limited to the properties of individual random variables included in the system, but also include mutual connections (dependencies) between random variables. Therefore, when studying a system of random variables, one should pay attention to the nature and degree of dependence. This dependence may be more or less pronounced, more or less close. And in other cases, random variables turn out to be practically independent.

A random variable Y is said to be independent of a random variable X if the distribution law of the random variable Y does not depend on the value that X takes.

It should be noted that the dependence and independence of random variables is always a mutual phenomenon: if Y does not depend on X, then the value X does not depend on Y. Taking this into account, we can give the following definition of the independence of random variables.

Random variables X and Y are called independent if the distribution law of each of them does not depend on what value the other takes. Otherwise, the quantities X and Y are called dependent.

The distribution law of a random variable is any relationship that establishes a connection between the possible values ​​of a random variable and their corresponding probabilities.

The concept of “dependence” of random variables, which is used in probability theory, is somewhat different from the usual concept of “dependence” of variables, which is used in mathematics. Thus, a mathematician by “dependence” means only one type of dependence - complete, rigid, so-called functional dependence. Two quantities X and Y are called functionally dependent if, knowing the value of one of them, you can accurately determine the value of the other.

In probability theory, we encounter a slightly different type of dependence - a probabilistic dependence. If the value Y is related to the value X by a probabilistic dependence, then, knowing the value of X, it is impossible to accurately indicate the value of Y, but you can indicate its distribution law, depending on what value the value X has taken.

The probabilistic relationship may be more or less close; As the tightness of the probabilistic dependence increases, it becomes closer and closer to the functional one. Thus, functional dependence can be considered as an extreme, limiting case of the closest probabilistic dependence. Another extreme case is the complete independence of random variables. Between these two extreme cases lie all gradations of probabilistic dependence - from the strongest to the weakest.

Probabilistic dependence between random variables is often encountered in practice. If random variables X and Y are in a probabilistic relationship, this does not mean that with a change in the value of X, the value of Y changes in a very definite way; this only means that as the value of X changes, the value of Y tends to also change (increase or decrease as X increases). This trend is observed only in general terms, and in each individual case deviations from it are possible.



Did you like the article? Share with your friends!