Regression analysis theory. A) Graphical analysis of simple linear regression

A) Graphical analysis of simple linear regression.

Simple linear regression equation y=a+bx. If there is a correlation between the random variables Y and X, then the value y = ý + ,

where ý is the theoretical value of y obtained from the equation ý = f(x),

 – error of deviation of the theoretical equation ý from the actual (experimental) data.

The equation for the dependence of the average value ý on x, that is, ý = f(x), is called the regression equation. Regression analysis consists of four stages:

1) setting the problem and establishing the reasons for the connection.

2) limitation of the research object, collection of statistical information.

3) selection of the coupling equation based on the analysis and nature of the data collected.

4) calculation of numerical values, characteristics of correlation connections.

If two variables are related in such a way that a change in one variable corresponds to a systematic change in the other variable, then regression analysis is used to estimate and select the equation for the relationship between them if these variables are known. Unlike regression analysis, correlation analysis is used to analyze the closeness of the relationship between X and Y.

Let's consider finding a straight line in regression analysis:

Theoretical regression equation.

The term "simple regression" indicates that the value of one variable is estimated based on knowledge about another variable. Unlike simple multivariate regression, it is used to estimate a variable based on knowledge of two, three or more variables. Let's look at the graphical analysis of simple linear regression.

Suppose there are results of screening tests on pre-employment and labor productivity.

Selection results (100 points), x

Productivity (20 points), y

By plotting the points on a graph, we obtain a scatter diagram (field). We use it to analyze the results of selection tests and labor productivity.

Using the scatterplot, let's analyze the regression line. In regression analysis, at least two variables are always specified. A systematic change in one variable is associated with a change in another. Main goal regression analysis consists of estimating the value of one variable if the value of another variable is known. For a complete task, the assessment of labor productivity is important.

Independent variable in regression analysis, a quantity that is used as a basis for analyzing another variable. In this case, these are the results of selection tests (along the X axis).

Dependent variable is called the estimated value (along the Y axis). In regression analysis, there can be only one dependent variable and more than one independent variable.

For simple regression analysis, the dependence can be represented in a two-coordinate system (x and y), with the X axis being the independent variable and the Y axis being the dependent variable. We plot the intersection points so that a pair of values ​​is represented on the graph. The schedule is called scatterplot. Its construction is the second stage of regression analysis, since the first is the selection of analyzed values ​​and collection of sample data. Thus, regression analysis is used for statistical analysis. The relationship between the sample data in a chart is linear.

To estimate the magnitude of a variable y based on a variable x, it is necessary to determine the position of the line that best represents the relationship between x and y based on the location of the points on the scatterplot. In our example, this is performance analysis. Line drawn through scattering points – regression line. One way to construct a regression line based on visual experience is the freehand method. Our regression line can be used to determine labor productivity. When finding the equation of the regression line

The least squares test is often used. The most suitable line is the one where the sum of squared deviations is minimal

The mathematical equation of a growth line represents the law of growth in an arithmetic progression:

at = AbX.

Y = A + bX– the given equation with one parameter is the simplest type of coupling equation. It is acceptable for average values. To more accurately express the relationship between X And at, an additional proportionality coefficient is introduced b, which indicates the slope of the regression line.

B) Construction of a theoretical regression line.

The process of finding it consists in choosing and justifying the type of curve and calculating parameters A, b, With etc. The construction process is called alignment, and the supply of curves offered by the mat. analysis, varied. Most often, in economic problems, a family of curves is used, equations that are expressed by polynomials of positive integer powers.

1)
– equation of a straight line,

2)
– hyperbola equation,

3)
– equation of a parabola,

where ý are the ordinates of the theoretical regression line.

Having chosen the type of equation, you need to find the parameters on which this equation depends. For example, the nature of the location of points in the scattering field showed that the theoretical regression line is straight.

A scatterplot allows you to represent labor productivity using regression analysis. In economics, regression analysis is used to predict many characteristics that affect the final product (taking into account pricing).

B) The criterion of the smallest frames for finding a straight line.

One criterion we might apply for a suitable regression line in a scatterplot is based on choosing the line for which the sum of squared errors is minimal.

The proximity of the scattering points to the straight line is measured by the ordinates of the segments. The deviations of these points can be positive and negative, but the sum of the squares of the deviations of the theoretical line from the experimental line is always positive and should be minimal. The fact that all scattering points do not coincide with the position of the regression line indicates the existence of a discrepancy between the experimental and theoretical data. Thus, we can say that no other regression line, except the one that was found, can give a smaller amount of deviations between the experimental and experimental data. Therefore, having found the theoretical equation ý and the regression line, we satisfy the least squares requirement.

This is done using the coupling equation
using formulas to find parameters A And b. Taking the theoretical value
and denoting the left side of the equation by f, we get the function
from unknown parameters A And b. Values A And b will satisfy the minimum function f and are found from partial differential equations
And
. This necessary condition, however, for a positive quadratic function this is also a sufficient condition for finding A And b.

Let us derive the parameter formulas from the partial derivative equations A And b:



we obtain a system of equations:

Where
– arithmetic mean errors.

Substituting numerical values, we find the parameters A And b.

There is a concept
. This is the approximation factor.

If e < 33%, то модель приемлема для дальнейшего анализа;

If e> 33%, then we take a hyperbola, parabola, etc. This gives the right for analysis in various situations.

Conclusion: according to the criterion of approximation coefficient, the most suitable line is the one for which

, and no other regression line for our problem gives a minimum deviation.

D) Square error of estimation, checking their typicality.

In relation to a population in which the number of research parameters is less than 30 ( n < 30), для проверки типичности параметров уравнения регрессии используется t-Student's t-test. This calculates the actual value t-criteria:

From here

Where – residual root-mean-square error. Received t a And t b compared with critical t k from the Student's table taking into account the accepted significance level ( = 0.01 = 99% or  = 0.05 = 95%). P = f = k 1 = m– number of parameters of the equation under study (degree of freedom). For example, if y = a + bx; m = 2, k 2 = f 2 = p 2 = n – (m+ 1), where n– number of studied characteristics.

t a < t k < t b .

Conclusion: using the parameters of the regression equation tested for typicality, a mathematical model of communication is built
. In this case, the parameters of the mathematical function used in the analysis (linear, hyperbola, parabola) receive the corresponding quantitative values. The semantic content of the models obtained in this way is that they characterize the average value of the resulting characteristic
from factor characteristic X.

D) Curvilinear regression.

Quite often, a curvilinear relationship occurs when a changing relationship is established between variables. The intensity of the increase (decrease) depends on the level of X. There are different types of curvilinear dependence. For example, consider the relationship between crop yield and precipitation. With an increase in precipitation under equal natural conditions, there is an intensive increase in yield, but up to a certain limit. After the critical point, precipitation turns out to be excessive, and yields drop catastrophically. The example shows that at first the relationship was positive and then negative. The critical point is the optimal level of attribute X, which corresponds to the maximum or minimum value of attribute Y.

In economics, such a relationship is observed between price and consumption, productivity and experience.

Parabolic dependence.

If the data show that an increase in a factor characteristic leads to an increase in the resultant characteristic, then a second-order equation (parabola) is taken as a regression equation.

. Coefficients a,b,c are found from partial differential equations:

We get a system of equations:

Types of curvilinear equations:

,

,

We have the right to assume that there is a curvilinear relationship between labor productivity and selection test scores. This means that as the scoring system increases, performance will begin to decrease at some level, so the straight model may turn out to be curvilinear.

The third model will be a hyperbola, and in all equations the variable x will be replaced by the expression .

CONCLUSION OF RESULTS

Table 8.3a. Regression statistics
Regression statistics
Plural R 0,998364
R-square 0,99673
Normalized R-squared 0,996321
Standard error 0,42405
Observations 10

First, let's look at the top part of the calculations, presented in table 8.3a - regression statistics.

The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). The measure of certainty is always within the interval.

In most cases, the R-squared value falls between these values, called extreme values, i.e. between zero and one.

If the R-squared value is close to one, this means that the constructed model explains almost all the variability in the relevant variables. Conversely, an R-squared value close to zero means the quality of the constructed model is poor.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

Plural R- multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y).

Multiple R is equal to the square root of the coefficient of determination, this value takes values ​​​​in the range from zero to one.

In simple linear regression analysis, multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients
Odds Standard error t-statistic
Y-intersection 2,694545455 0,33176878 8,121757129
Variable X 1 2,305454545 0,04668634 49,38177965
* A truncated version of the calculations is provided

Now consider the middle part of the calculations, presented in table 8.3b. Here the regression coefficient b (2.305454545) and the displacement along the ordinate axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).

If the sign at regression coefficient- positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).

In table 8.3c. The results of the derivation of residuals are presented. In order for these results to appear in the report, you must activate the “Residuals” checkbox when running the “Regression” tool.

WITHDRAWAL OF THE REST

Table 8.3c. Leftovers
Observation Predicted Y Leftovers Standard balances
1 9,610909091 -0,610909091 -1,528044662
2 7,305454545 -0,305454545 -0,764022331
3 11,91636364 0,083636364 0,209196591
4 14,22181818 0,778181818 1,946437843
5 16,52727273 0,472727273 1,182415512
6 18,83272727 0,167272727 0,418393181
7 21,13818182 -0,138181818 -0,34562915
8 23,44363636 -0,043636364 -0,109146047
9 25,74909091 -0,149090909 -0,372915662
10 28,05454545 -0,254545455 -0,636685276

Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute value

In statistical modeling, regression analysis is a study used to evaluate the relationship between variables. This mathematical method includes many other methods for modeling and analyzing multiple variables where the focus is on the relationship between a dependent variable and one or more independent ones. More specifically, regression analysis helps us understand how the typical value of a dependent variable changes if one of the independent variables changes while the other independent variables remain fixed.

In all cases, the target estimate is a function of the independent variables and is called a regression function. In regression analysis, it is also of interest to characterize the change in the dependent variable as a function of regression, which can be described using a probability distribution.

Regression Analysis Problems

This statistical research method is widely used for forecasting, where its use has significant advantage, but sometimes it can lead to illusion or false relationships, so it is recommended to use it carefully in the said matter, since, for example, correlation does not mean causation.

A large number of methods have been developed for regression analysis, such as linear and ordinary least squares regression, which are parametric. Their essence is that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression allows its function to lie within a specific set of functions, which can be infinite-dimensional.

As a statistical research method, regression analysis in practice depends on the form of the data generation process and how it relates to the regression approach. Since the true form of the data process generating is usually an unknown number, regression analysis of the data often depends to some extent on assumptions about the process. These assumptions are sometimes testable if there is enough data available. Regression models are often useful even when the assumptions are moderately violated, although they may not perform at peak efficiency.

In a narrower sense, regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The continuous output variable case is also called metric regression to distinguish it from related problems.

Story

The earliest form of regression is the well-known least squares method. It was published by Legendre in 1805 and Gauss in 1809. Legendre and Gauss applied the method to the problem of determining from astronomical observations the orbits of bodies around the Sun (mostly comets, but later also newly discovered minor planets). Gauss published a further development of least squares theory in 1821, including a version of the Gauss–Markov theorem.

The term "regression" was coined by Francis Galton in the 19th century to describe a biological phenomenon. The idea was that the height of descendants from that of their ancestors tends to regress downwards towards the normal mean. For Galton, regression had only this biological meaning, but later his work was continued by Udney Yoley and Karl Pearson and brought into a more general statistical context. In the work of Yule and Pearson, the joint distribution of response and explanatory variables is assumed to be Gaussian. This assumption was rejected by Fischer in papers of 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this regard, Fischer's proposal is closer to Gauss's formulation of 1821. Before 1970, it sometimes took up to 24 hours to get the result of a regression analysis.

Regression analysis methods continue to be an area of ​​active research. In recent decades, new methods have been developed for robust regression; regressions involving correlated responses; regression methods that accommodate different types of missing data; nonparametric regression; Bayesian regression methods; regressions in which predictor variables are measured with error; regression with more predictors than observations, and cause-and-effect inference with regression.

Regression models

Regression analysis models include the following variables:

  • Unknown parameters, designated beta, which can be a scalar or a vector.
  • Independent Variables, X.
  • Dependent Variables, Y.

Different fields of science where regression analysis is used use different terms in place of dependent and independent variables, but in all cases the regression model relates Y to a function of X and β.

The approximation is usually written as E(Y | X) = F(X, β). To carry out regression analysis, the type of function f must be determined. Less commonly, it is based on knowledge about the relationship between Y and X, which does not rely on data. If such knowledge is not available, then the flexible or convenient form F is chosen.

Dependent variable Y

Let us now assume that the vector of unknown parameters β has length k. To perform regression analysis, the user must provide information about the dependent variable Y:

  • If N data points of the form (Y, X) are observed, where N< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.
  • If exactly N = K are observed and the function F is linear, then the equation Y = F(X, β) can be solved exactly rather than approximately. This amounts to solving a set of N-equations with N-unknowns (elements β) that has a unique solution as long as X is linearly independent. If F is nonlinear, there may be no solution, or many solutions may exist.
  • The most common situation is where N > data points are observed. In this case, there is enough information in the data to estimate a unique value for β that best fits the data, and a regression model where the application to the data can be viewed as an overdetermined system in β.

In the latter case, regression analysis provides tools for:

  • Finding a solution for the unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
  • Under certain statistical assumptions, regression analysis uses excess information to provide statistical information about the unknown parameters β and the predicted values ​​of the dependent variable Y.

Required number of independent measurements

Consider a regression model that has three unknown parameters: β 0 , β 1 and β 2 . Suppose the experimenter makes 10 measurements on the same value of the independent variable vector X. In this case, regression analysis does not produce a unique set of values. The best you can do is estimate the mean and standard deviation of the dependent variable Y. Similarly, by measuring two different values ​​of X, you can obtain enough data for regression with two unknowns, but not with three or more unknowns.

If the experimenter's measurements were made at three different values ​​of the independent variable vector X, then the regression analysis will provide a unique set of estimates for the three unknown parameters in β.

In the case of general linear regression, the above statement is equivalent to the requirement that the matrix X T X is invertible.

Statistical Assumptions

When the number of measurements N is greater than the number of unknown parameters k and the measurement errors ε i , then, as a rule, the excess information contained in the measurements is then disseminated and used for statistical predictions regarding the unknown parameters. This excess information is called the regression degree of freedom.

Fundamental Assumptions

Classic assumptions for regression analysis include:

  • Sampling is representative of inference prediction.
  • The error term is a random variable with a mean of zero, which is conditional on the explanatory variables.
  • Independent variables are measured without errors.
  • As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor as a linear combination of the others.
  • The errors are uncorrelated, that is, the error covariance matrix of the diagonals and each non-zero element is the error variance.
  • The error variance is constant across observations (homoscedasticity). If not, then weighted least squares or other methods can be used.

These sufficient conditions for least squares estimation have the required properties, in particular these assumptions mean that parameter estimates will be objective, consistent and efficient, especially when taken into account in the class of linear estimators. It is important to note that evidence rarely satisfies the conditions. That is, the method is used even if the assumptions are not correct. Variation from the assumptions can sometimes be used as a measure of how useful the model is. Many of these assumptions can be relaxed in more advanced methods. Statistical analysis reports typically include analysis of tests on sample data and methodology for the usefulness of the model.

Additionally, variables in some cases refer to values ​​measured at point locations. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographic weighted regression is the only method that deals with such data.

A feature of linear regression is that the dependent variable, which is Yi, is a linear combination of parameters. For example, simple linear regression uses one independent variable, x i , and two parameters, β 0 and β 1 , to model n-points.

In multiple linear regression, there are multiple independent variables or functions of them.

When a random sample is taken from a population, its parameters allow one to obtain a sample linear regression model.

In this aspect, the most popular is the least squares method. It is used to obtain parameter estimates that minimize the sum of squared residuals. This kind of minimization (which is typical of linear regression) of this function leads to a set of normal equations and a set of linear equations with parameters, which are solved to obtain parameter estimates.

Under the further assumption that population error is generally propagated, a researcher can use these standard error estimates to create confidence intervals and conduct hypothesis tests about its parameters.

Nonlinear regression analysis

An example where the function is not linear with respect to the parameters indicates that the sum of squares should be minimized using an iterative procedure. This introduces many complications that define the differences between linear and nonlinear least squares methods. Consequently, the results of regression analysis when using a nonlinear method are sometimes unpredictable.

Calculation of power and sample size

There are generally no consistent methods regarding the number of observations versus the number of independent variables in the model. The first rule was proposed by Dobra and Hardin and looks like N = t^n, where N is the sample size, n is the number of independent variables, and t is the number of observations needed to achieve the desired accuracy if the model had only one independent variable. For example, a researcher builds a linear regression model using a data set that contains 1000 patients (N). If the researcher decides that five observations are needed to accurately define the line (m), then the maximum number of independent variables that the model can support is 4.

Other methods

Although regression model parameters are typically estimated using the least squares method, there are other methods that are used much less frequently. For example, these are the following methods:

  • Bayesian methods (for example, Bayesian linear regression).
  • Percentage regression, used for situations where reducing percentage errors is considered more appropriate.
  • Smallest absolute deviations, which is more robust in the presence of outliers leading to quantile regression.
  • Nonparametric regression, which requires a large number of observations and calculations.
  • A distance learning metric that is learned to find a meaningful distance metric in a given input space.

Software

All major statistical software packages perform least squares regression analysis. Simple linear regression and multiple regression analysis can be used in some spreadsheet applications as well as some calculators. Although many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods. Specialized regression software has been developed for use in areas such as examination analysis and neuroimaging.

The concepts of correlation and regression are directly related. There are many common computational techniques in correlation and regression analysis. They are used to identify cause-and-effect relationships between phenomena and processes. However, if correlation analysis allows us to estimate the strength and direction of the stochastic connection, then regression analysis- also a form of dependence.

Regression can be:

a) depending on the number of phenomena (variables):

Simple (regression between two variables);

Multiple (regression between the dependent variable (y) and several explanatory variables (x1, x2...xn);

b) depending on the form:

Linear (displayed by a linear function, and there are linear relationships between the variables being studied);

Nonlinear (displayed by a nonlinear function; the relationship between the variables being studied is nonlinear);

c) by the nature of the relationship between the variables included in the consideration:

Positive (an increase in the value of the explanatory variable leads to an increase in the value of the dependent variable and vice versa);

Negative (as the value of the explanatory variable increases, the value of the explained variable decreases);

d) by type:

Direct (in this case, the cause has a direct impact on the effect, i.e. the dependent and explanatory variables are directly related to each other);

Indirect (the explanatory variable has an indirect effect through a third or a number of other variables on the dependent variable);

False (nonsense regression) - can arise with a superficial and formal approach to the processes and phenomena being studied. An example of a nonsensical one is a regression establishing a connection between a decrease in the amount of alcohol consumed in our country and a decrease in the sale of washing powder.

When conducting regression analysis, the following main tasks are solved:

1. Determination of the form of dependence.

2. Definition of the regression function. To do this, a mathematical equation of one type or another is used, which allows, firstly, to establish the general trend of change in the dependent variable, and, secondly, to calculate the influence of the explanatory variable (or several variables) on the dependent variable.

3. Estimation of unknown values ​​of the dependent variable. The resulting mathematical relationship (regression equation) allows you to determine the value of the dependent variable both within the interval of specified values ​​of the explanatory variables and beyond it. In the latter case, regression analysis acts as a useful tool in predicting changes in socio-economic processes and phenomena (provided that existing trends and relationships are maintained). Typically, the length of the time period for which forecasting is carried out is selected to be no more than half the time interval over which observations of the initial indicators were carried out. It is possible to carry out both a passive forecast, solving the extrapolation problem, and an active one, reasoning according to the well-known “if..., then” scheme and substituting various values ​​into one or more explanatory regression variables.



For regression construction a special method called least squares method. This method has advantages over other smoothing methods: a relatively simple mathematical determination of the required parameters and a good theoretical justification from a probabilistic point of view.

When choosing a regression model, one of the essential requirements for it is to ensure the greatest possible simplicity, allowing you to obtain a solution with sufficient accuracy. Therefore, to establish statistical relationships, first, as a rule, we consider a model from the class of linear functions (as the simplest of all possible classes of functions):

where bi, b2...bj are coefficients that determine the influence of independent variables xij on the value yi; ai - free member; ei - random deviation, which reflects the influence of unaccounted factors on the dependent variable; n - number of independent variables; N is the number of observations, and the condition (N . n+1) must be met.

Linear model can describe a very wide class of different tasks. However, in practice, in particular in socio-economic systems, it is sometimes difficult to use linear models due to large approximation errors. Therefore, nonlinear multiple regression functions that can be linearized are often used. These include, for example, the production function (Cobb-Douglas power function), which has found application in various socio-economic studies. It looks like:

where b 0 is the normalization factor, b 1 ...b j are unknown coefficients, e i is a random deviation.

Using natural logarithms, you can transform this equation into linear form:

The resulting model allows the use of standard linear regression procedures described above. By constructing models of two types (additive and multiplicative), you can select the best one and conduct further research with smaller approximation errors.

There is a well-developed system for selecting approximating functions - method of group accounting of arguments(MGUA).

The correctness of the selected model can be judged by the results of studying the residuals, which are the differences between the observed values ​​y i and the corresponding values ​​y i predicted using the regression equation. In this case to check the adequacy of the model calculated average approximation error:

The model is considered adequate if e is within no more than 15%.

We especially emphasize that in relation to socio-economic systems, the basic conditions for the adequacy of the classical regression model are not always met.

Without dwelling on all the reasons for the inadequacy that arises, we will only name multicollinearity- the most difficult problem of effectively applying regression analysis procedures in the study of statistical dependencies. Under multicollinearity it is understood that there is a linear relationship between the explanatory variables.

This phenomenon:

a) distorts the meaning of regression coefficients when interpreting them meaningfully;

b) reduces the accuracy of assessment (the dispersion of assessments increases);

c) increases the sensitivity of coefficient estimates to sample data (increasing the sample size can greatly affect the estimates).

There are various techniques for reducing multicollinearity. The most accessible way is to eliminate one of the two variables if the correlation coefficient between them exceeds a value equal in absolute value to 0.8. Which of the variables to keep is decided based on substantive considerations. Then the regression coefficients are calculated again.

Using a stepwise regression algorithm allows you to sequentially include one independent variable into the model and analyze the significance of regression coefficients and multicollinearity of variables. Finally, only those variables remain in the relationship under study that provide the necessary significance of the regression coefficients and minimal influence of multicollinearity.

Regression analysis is a method of modeling measured data and studying their properties. The data consists of pairs of values ​​of the dependent variable (response variable) and the independent variable (explanatory variable). A regression model is a function of the independent variable and parameters with an added random variable.

Correlation analysis and regression analysis are related sections of mathematical statistics, and are intended to study the statistical dependence of a number of quantities using sample data; some of which are random. With statistical dependence, the quantities are not functionally related, but are defined as random variables by a joint probability distribution.

The study of the dependence of random variables leads to regression models and regression analysis based on sample data. Probability theory and mathematical statistics represent only a tool for studying statistical dependence, but do not aim to establish a causal relationship. Ideas and hypotheses about a causal relationship must be brought from some other theory that allows a meaningful explanation of the phenomenon being studied.

Numerical data usually has explicit (known) or implicit (hidden) relationships with each other.

The indicators that are obtained by direct calculation methods, i.e., calculated using previously known formulas, are clearly related. For example, percentages of plan completion, levels, specific weights, deviations in the amount, deviations in percentages, growth rates, growth rates, indices, etc.

Connections of the second type (implicit) are unknown in advance. However, it is necessary to be able to explain and predict (forecast) complex phenomena in order to manage them. Therefore, specialists, with the help of observations, strive to identify hidden dependencies and express them in the form of formulas, that is, to mathematically model phenomena or processes. One such opportunity is provided by correlation-regression analysis.

Mathematical models are built and used for three general purposes:

  • * for explanation;
  • * for prediction;
  • * for management.

Using the methods of correlation and regression analysis, analysts measure the closeness of connections between indicators using the correlation coefficient. In this case, connections are discovered that are different in strength (strong, weak, moderate, etc.) and different in direction (direct, reverse). If the connections turn out to be significant, then it would be advisable to find their mathematical expression in the form of a regression model and evaluate the statistical significance of the model.

Regression analysis is called the main method of modern mathematical statistics for identifying implicit and veiled connections between observational data.

The problem statement of regression analysis is formulated as follows.

There is a set of observational results. In this set, one column corresponds to an indicator for which it is necessary to establish a functional relationship with the parameters of the object and environment represented by the remaining columns. Required: establish a quantitative relationship between the indicator and the factors. In this case, the problem of regression analysis is understood as the task of identifying such a functional dependence y = f (x2, x3, ..., xт), which best describes the available experimental data.

Assumptions:

the number of observations is sufficient to demonstrate statistical patterns regarding factors and their relationships;

the processed data contains some errors (noise) due to measurement errors and the influence of unaccounted random factors;

the matrix of observation results is the only information about the object being studied that is available before the start of the study.

The function f (x2, x3, ..., xт), which describes the dependence of the indicator on the parameters, is called the regression equation (function). The term “regression” (regression (Latin) - retreat, return to something) is associated with the specifics of one of the specific problems solved at the stage of formation of the method.

It is advisable to split the solution to the problem of regression analysis into several stages:

data pre-processing;

choosing the type of regression equations;

calculation of regression equation coefficients;

checking the adequacy of the constructed function to the observation results.

Pre-processing includes standardizing the data matrix, calculating correlation coefficients, checking their significance and excluding insignificant parameters from consideration.

Choosing the type of regression equation The task of determining the functional relationship that best describes the data involves overcoming a number of fundamental difficulties. In the general case, for standardized data, the functional dependence of the indicator on the parameters can be represented as

y = f (x1, x2, …, xm) + e

where f is a previously unknown function to be determined;

e - data approximation error.

This equation is usually called the sample regression equation. This equation characterizes the relationship between the variation of the indicator and the variations of the factors. And the correlation measure measures the proportion of variation in an indicator that is associated with variation in factors. In other words, the correlation between an indicator and factors cannot be interpreted as a connection between their levels, and regression analysis does not explain the role of factors in creating an indicator.

Another feature concerns the assessment of the degree of influence of each factor on the indicator. The regression equation does not provide an assessment of the separate influence of each factor on the indicator; such an assessment is possible only in the case when all other factors are not related to the one being studied. If the factor being studied is related to others that influence the indicator, then a mixed characteristic of the factor’s influence will be obtained. This characteristic contains both the direct influence of the factor and the indirect influence exerted through the connection with other factors and their influence on the indicator.

It is not recommended to include factors that are weakly related to the indicator, but are closely related to other factors, in the regression equation. Factors that are functionally related to each other are not included in the equation (for them the correlation coefficient is 1). The inclusion of such factors leads to degeneration of the system of equations for estimating regression coefficients and to the uncertainty of the solution.

The function f must be selected so that the error e is in some sense minimal. In order to select a functional connection, a hypothesis is put forward in advance about which class the function f may belong to, and then the “best” function in this class is selected. The selected class of functions must have some “smoothness”, i.e. "small" changes in argument values ​​should cause "small" changes in function values.

A special case widely used in practice is a first degree polynomial or linear regression equation

To select the type of functional dependence, the following approach can be recommended:

points with indicator values ​​are graphically displayed in the parameter space. With a large number of parameters, it is possible to construct points for each of them, obtaining two-dimensional distributions of values;

based on the location of the points and based on an analysis of the essence of the relationship between the indicator and the parameters of the object, a conclusion is made about the approximate type of regression or its possible options;

After calculating the parameters, the quality of the approximation is assessed, i.e. evaluate the degree of similarity between calculated and actual values;

if the calculated and actual values ​​are close throughout the entire task area, then the problem of regression analysis can be considered solved. Otherwise, you can try to choose a different type of polynomial or another analytical function, such as a periodic one.

Calculating Regression Equation Coefficients

It is impossible to unambiguously solve a system of equations based on the available data, since the number of unknowns is always greater than the number of equations. To overcome this problem, additional assumptions are needed. Common sense dictates: it is advisable to choose the coefficients of the polynomial in such a way as to ensure a minimum error in data approximation. Various measures can be used to evaluate approximation errors. The root mean square error is widely used as such a measure. On its basis, a special method for estimating the coefficients of regression equations has been developed - the least squares method (LSM). This method allows you to obtain maximum likelihood estimates of the unknown coefficients of the regression equation under the normal distribution option, but it can be used for any other distribution of factors.

The MNC is based on the following provisions:

the values ​​of the errors and factors are independent, and therefore uncorrelated, i.e. it is assumed that the mechanisms for generating interference are not related to the mechanism for generating factor values;

the mathematical expectation of the error e must be equal to zero (the constant component is included in the coefficient a0), in other words, the error is a centered quantity;

the sample estimate of error variance should be minimal.

If the linear model is inaccurate or the parameters are measured inaccurately, then in this case the least squares method allows us to find such values ​​of the coefficients at which the linear model best describes the real object in the sense of the selected standard deviation criterion.

The quality of the resulting regression equation is assessed by the degree of closeness between the results of observations of the indicator and the values ​​​​predicted by the regression equation at given points in the parameter space. If the results are close, then the problem of regression analysis can be considered solved. Otherwise, you should change the regression equation and repeat the calculations to estimate the parameters.

If there are several indicators, the problem of regression analysis is solved independently for each of them.

Analyzing the essence of the regression equation, the following points should be noted. The considered approach does not provide separate (independent) assessment of coefficients - a change in the value of one coefficient entails a change in the values ​​of others. The obtained coefficients should not be considered as the contribution of the corresponding parameter to the value of the indicator. The regression equation is just a good analytical description of the available data, and not a law describing the relationship between the parameters and the indicator. This equation is used to calculate the values ​​of the indicator in a given range of parameter changes. It is of limited suitability for calculations outside this range, i.e. it can be used to solve interpolation problems and, to a limited extent, extrapolation.

The main reason for the inaccuracy of the forecast is not so much the uncertainty of extrapolation of the regression line, but rather the significant variation of the indicator due to factors not taken into account in the model. The limitation of the forecasting ability is the condition of stability of parameters not taken into account in the model and the nature of the influence of the model factors taken into account. If the external environment changes sharply, then the compiled regression equation will lose its meaning.

The forecast obtained by substituting the expected value of the parameter into the regression equation is a point one. The likelihood of such a forecast being realized is negligible. It is advisable to determine the confidence interval of the forecast. For individual values ​​of the indicator, the interval should take into account errors in the position of the regression line and deviations of individual values ​​from this line.



Did you like the article? Share with your friends!