Constructing a correlation field from the data in the table. Correlation and regression data analysis

A visual representation of a correlation table is the correlation field. It is a graph where X values are plotted on the abscissa axis, Y values are plotted on the ordinate axis, and combinations of X and Y are shown by dots. By the location of the dots, one can judge the presence of a connection.

Using the graphical method.

This method is used to visually depict the form of connection between the studied economic indicators. To do this, a graph is drawn in a rectangular coordinate system, the individual values of the resultant characteristic Y are plotted along the ordinate axis, and the individual values of the factor characteristic X are plotted along the abscissa axis.

The set of points of the resultant and factor characteristics is called the correlation field.

Based on the correlation field, we can hypothesize (for the population) that the relationship between all possible values of X and Y is linear.

The linear regression equation is y = bx + a + ε

Here ε is a random error (deviation, disturbance).

Reasons for the existence of a random error:

1. Failure to include significant explanatory variables in the regression model;

2. Aggregation of variables. For example, the total consumption function is an attempt to express generally the aggregate of individual spending decisions. This is only an approximation of individual relationships that have different parameters.

3. Incorrect description of the model structure;

4. Incorrect functional specification;

21. Correlation and regression analysis.

Correlation-regression analysis as a general concept includes measuring the closeness and direction of a connection and establishing an analytical expression (form) of the connection (regression analysis).

The purpose of regression analysis is to assess the functional dependence of the conditional average value of the resultant characteristic (Y) on the factor factors (x1, x2, ..., xk).

The regression equation, or statistical model of the relationship between socio-economic phenomena, is expressed by the function:

Yx = f(x1, x2, …, xn),

where “n” is the number of factors included in the model;

Хi – factors influencing the result Y.

Stages of correlation and regression analysis:

Preliminary (a priori) analysis. It gives good results if carried out by a sufficiently qualified researcher.

Collection of information and its primary processing.

Building a model (regression equations). As a rule, this procedure is performed on a PC using standard programs.

Assessing the closeness of relationships between features, estimating the regression equation and analyzing the model.

Forecasting the development of the analyzed system using the regression equation.

At the first stage, the research problem is formulated, the methodology for measuring indicators or collecting information is determined, the number of factors is determined, and duplicate factors or those linked into a rigidly determined system are eliminated.

At the second stage, the volume of units is analyzed: the population must be large enough in terms of the number of units and observations (N>>50), the number of factors “n” must correspond to the number of observations “N”. Data must be quantitatively and qualitatively homogeneous.

At the third stage, the form of the connection and the type of analytical function (parabola, hyperbola, straight line) are determined and its parameters are found.

At the fourth stage, the reliability of all characteristics of the correlation relationship and the regression equation is assessed using the Fisher or Student reliability criterion, and an economic and technological analysis of the parameters is performed.

At the fifth stage, the possible values of the result are predicted based on the best values of the factor characteristics included in the model. Here the best and worst values of the factors and the result are selected.

22. Types of regression equations.

To quantitatively describe the relationships between economic variables, statistics use regression and correlation methods.

Regression is a quantity that expresses the dependence of the average value of a random variable y on the values of a random variable x.

The regression equation expresses the average value of one characteristic as a function of another.

The regression function is a model of the form y = l", where y is the dependent variable (resultative attribute); x is an independent, or explanatory, variable (feature-factor).

Regression line - graph of the function y = f (x).

2 types of relationships between x and y:

1) it may be unknown which of the two variables is independent and which is dependent, the variables are equal, this is a correlation type relationship;

2) if x and y are unequal and one of them is considered as an explanatory (independent) variable, and the other as a dependent variable, then this is a regression type relationship.

Types of regressions:

1) hyperbolic - regression of an equilateral hyperbola: y = a + b / x + E;

2) linear - regression used in statistics in the form of a clear economic interpretation of its parameters: y = a+b*x+E;

3) logarithmically linear - regression of the form: In y = In a + b * In x + In E

4) multiple - regression between variables y and x1, x2 ...xm, i.e. a model of the form: y = f(x1, x2 ...xm)+E, where y is the dependent variable (resultative attribute), x1 , x2 ...xm - independent explanatory variables (features-factors), E - disturbance or stochastic variable, including the influence of unaccounted factors in the model;

5) nonlinear - regression that is nonlinear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters; or regression that is nonlinear in the parameters being estimated.

6) inverse - regression reduced to linear form, implemented in standard application packages of the form: y = 1/a + b*x+E;

paired - regression between two variables y and x, i.e., a model of the form: y = f (x) + E, where y is the dependent variable (resultative attribute), x is the independent, explanatory variable (attribute - factor), E - disturbance, or stochastic variable, including the influence of unaccounted factors in the model.

Dynamic series and their types

A time series always consists of 2 elements: 1) a point in time or time period in relation to which statistical data are provided, 2) a statistical indicator called the level of the time series.

Depending on the content of the time indicator, dynamics series can be moment or interval

Depending on the type of statistical indicator, time series are divided into series of absolute, relative and average values

Absolute show exact values

Relative ones show changes in the specific weights of the indicator in the total population

Average values contain the change over time of the indicator, which is the average level of the phenomenon

Indicators of a series of dynamics. Average level of the dynamics series.

Indicators: 1) average level of dynamic series, 2) absolute growth, chain and basic, average absolute growth, 3) growth and growth rates, chain and basic, average growth and increment rate, 4) fmcjk.nyst values 1 % increase

Average dynamics

Generalized characteristics of a number of dynamics, with their help the intensity of development of a phenomenon is compared in relation to different objects, for example, by country, industry, enterprise

Average level at current time ui. The method for calculating the average level depends on the type of series (instant/interval) (with equal/different intervals). If an interval series of dynamics of absolute or average values with equal time intervals is given, then to calculate the average level, the formula for calculating the average simple value is used. If the time intervals of the interval series are unequal, then the average level is found using the weighted arithmetic mean. Usr=smmUi*Ti/smmTi

25. Absolute increase(delta and) is the difference between two levels of a dynamic series, which shows how much a given level of a series exceeds the level taken as the basis of comparison. Delta u=Ui-U0

Delta u=Ui-Ui-1

Absolute acceleration- the difference between the absolute growth for a given period and the absolute growth for the previous period of the same duration: Delta and with the line = delta and - delta and -1. Absolute acceleration shows how much the rate of change of an indicator has increased (decreased). The acceleration indicator is used for chain absolute increments. A negative acceleration value indicates a slowdown in growth or an acceleration in the decline in series levels.

Indicators of relative changes in the levels of a series of dynamics.

Growth rate (growth rate)- this is the ratio of two compared levels, which shows how many times this level exceeds the level of the base period. Reflects the intensity of changes in the levels of a series of dynamics and shows how many times the level has increased compared to the base level, and in the case of a decrease, what part of the base level is the compared level.

Formula for calculating the growth rate: when compared with a constant base: K i .=y i /y 0 , when compared with a variable base: K i .=y i /y i -1 .

Growth rate is the growth rate expressed as a percentage:

T r = TO 100 %.

Growth rates for any time series are interval indicators, i.e. characterize a particular period (interval) of time.

Rate of increase- relative amount of growth, i.e. the ratio of absolute growth to the previous or baseline level. Characterizes by what percentage the level of a given period is greater (or less) than the base level.

Rate of increase- the ratio of absolute growth to the level taken as the basis of comparison:

Tpr=Ui-U0/U0*100%

Rate of increase- the difference between the growth rate (in percent) and 100,

You will need

- distribution series from the dependent and independent variable;
- paper, pencil;
- computer and spreadsheet program.

Instructions

Choose two that you believe have a relationship, usually those that change over time. Note that one of the variables must be independent; it will act as a cause. The second one should change with it - decrease, increase or change randomly.

Measure the value of the dependent variable for each independent variable. Record your results in a table, in two rows or two columns. To detect the presence of a connection, at least 30 readings are needed, but for a more accurate result, ensure that there are at least 100 points.

Construct a coordinate plane, and plot the values of the dependent variable on the ordinate axis, and the independent variable on the abscissa axis. Label the axes and indicate the units of measurement for each indicator.

Mark the points of the correlation field on the graph. On the x-axis, find the first value of the independent variable, and on the y-axis, find the corresponding value of the dependent variable. Construct perpendiculars to these projections and find the first point. Mark it, circle it with a soft pencil or pen. Construct all other points in the same way.

The resulting set of points is called correlation field. Analyze the resulting graph, draw conclusions about the presence of a strong or weak cause-and-effect relationship, or its absence.

Please note occasional deviations from the schedule. If, in general, a linear or other relationship can be traced, but the whole “picture” is spoiled by one or two points that are apart from the general population, they can be caused by random errors and not taken into account when interpreting the graph.

If you need to build and analyze a field correlations For large amounts of data, use spreadsheet programs, such as Excel, or purchase special programs.

The relationship between several quantities, during which changes in one leads to changes in the others, is called correlation. It can be simple, multiple or partial. This concept is accepted not only in mathematics, but also in biology.

Word correlation comes from the Latin correlatio, relationship. All phenomena, events and objects, as well as the quantities characterizing them, are interconnected. Correlation dependence differs from functional dependence in that in this type of dependence, any can be measured only on average, approximately. Correlation dependence assumes that a variable value corresponds to changes in an independent value only with a certain degree of probability. The degree of dependence is called the correlation coefficient. The concept of correlation is the relationship between the structure and functions of individual parts of the body. Quite often the concept correlation used by statisticians. In statistics, this is the relationship between statistical quantities, series and groups. To determine the presence or absence or existence of a correlation, a special method is used. The correlation method is used to determine the direct or reverse changes in numbers in the series that are being compared. When found, then the measure or degree of parallelism itself. But internal cause-and-effect factors are not found in this way. The main task of statistics as a science is to detect such causal dependencies for other sciences. In form, a correlation relationship can be linear or nonlinear, positive and negative. When, as one variable increases or decreases, the other also increases or decreases, then the relationship is linear. If, when one quantity changes, the nature of the changes in another is nonlinear, then this correlation nonlinear.Positive correlation is considered when an increase in the level of one value is accompanied by an increase in the level of another. For example, when an increase in sound is accompanied by a feeling of an increase in its pitch. A correlation when an increase in the level of one variable is accompanied by a decrease in the level of another is called negative. In communities, an increased level of anxiety of an individual leads to a decrease in the probability of this individual occupying a dominant niche among its fellows. When there is no connection between variables, correlation is called zero.

Video on the topic

Sources:

Nonlinear correlation in 2019

Correlation is the mutual dependence of two random variables (usually two groups of values), in which a change in one of them leads to a change in the other. The correlation coefficient shows how likely it is that the second value will change when the values of the first change, i.e. the degree of her dependence. The easiest way to calculate this value is to use the corresponding function built into the Microsoft Office Excel spreadsheet editor.

You will need

Microsoft Office Excel spreadsheet editor.

Instructions

Launch Excel and open a document containing groups of data that you want to calculate the correlation coefficient between. If such a document has not yet been created, then enter the data in - the spreadsheet editor creates it automatically when you start the program. Enter each of the groups of values, the correlation between which you are interested in, in a separate column. These do not have to be adjacent columns; you are free to design the table in the most convenient way - add additional columns with explanations of the data, column headings, summary cells with total or average values, etc. You can even arrange data not in a vertical direction (in columns), but in a horizontal direction (in rows). The only requirement that must be met is that the cells with the data of each group must be located sequentially one after the other, so that a continuous array is created in this way.

Go to the cell that should contain the correlation value of the data of the two arrays, and click on the “Formulas” tab in the Excel menu. In the "Function Library" command group, click on the most recent icon - "More Functions". A drop-down list will open in which you should go to the “Statistical” section and select the CORREL function. As a result, the Function Wizard window will open with a form for you to fill out. The same window can be called up without the “Formulas” tab by simply clicking on the insert function icon located to the left of the formula bar.

Specify the first group of correlating data in the Array1 field of the Formula Wizard. To enter a range of cells manually, type the address of the first and last cells, separating them with a colon (no spaces). Another option is to simply select the desired range with the mouse, and Excel will place the desired entry in this form field on its own. The same operation must be done with the second group of data in the “Array2” field.

Click OK. The spreadsheet editor will calculate and display the correlation value in the cell with the formula. If necessary, you can save this document for future use (keyboard shortcut Ctrl + S).

We construct a correlation field for the main and associated components. On the abscissa axis we plot the content of the main component, in this case Hg, and on the ordinate axis we plot the content of the associated component, i.e. Sn.

To make a preliminary assessment of the strength of the connection in the correlation field, it is necessary to draw lines corresponding to the median values of the main and associated components, dividing the field into four squares.

A quantitative measure of the strength of a connection is the correlation coefficient. Its approximate estimate is calculated using the formula:

where n1 is the total number of points in I and III, n2 = the total number of points in II and IV.

I = 4 II = 8 III = 7 IV = 5

Next, using the computer-calculated initial data (Хср, Yср, variances Dx, Dy, and their covariance cov(x,y)), we calculate the value of the correlation coefficient r and the parameters of the linear regression equations of the associated component by the principal and the principal component by the associated component.

We calculate using the following formulas:

Initial data:

cov(x, y) = 163.86

r = cov(x, y)/√Dx * Dy = 163.86/√157.27* 645.61= 0.51

b = cov(x, y)/Dx = 163.86/157.27= 1.04

a = Yavg – b * Xavg = 153.13– (-0.08) * 36.75 = 150.19

d = cov(x, y)/ Dy = 163.86/645.61= 0.25

c = Хср – d * Yср = 36.75– (0.25) * 153.13= -1.5

y =150.19+1.04x x = -1.5+0.25y

We build regression lines on the correlation field.

Stage 7. Testing the hypothesis about the presence of a correlation relationship

Testing the hypothesis about the presence of a correlation is based on the fact that for a two-dimensional normally distributed random variable X, Y, in the absence of correlation between x and y, the correlation coefficient is “0”. To test the hypothesis about the absence of a correlation, it is necessary to calculate the value of the criterion:

t = r * √(N – 2)/√(1 – r2) = 0.51* √(24-2)/√(1 – (0.51) 2) = 2.65

For our values t = 2.65

Table value ttab = 2.02

Since the calculated t value exceeds the table value, the hypothesis about the absence of a correlation is rejected. There is a connection.

Stage 8. Construction of empirical regression lines. Calculating the correlation ratio

Selected data are grouped into classes according to the content values of the main component, in this case Hg. To do this, the entire range of values from the minimum content of the main useful component to the maximum content is divided into 6 intervals. For each interval:

The number of values falling into this interval n(i) is determined

The number of associated component content values corresponding to the main component values (y(I, av)) is calculated and this number is divided by n(i)

Table 3

Interval boundary

We build an empirical regression line on the correlation field.

dtotal = √Dy = 25.4

dcondition = /N = 66.14

The value of the correlation ratio of the associated component to the main r is calculated using the formula:

r = dcondition/ dtot = 66.14/25.4 = 2.6

For experimental study of dependencies between random variables x and y carry out a number of independent experiments. Result i- experiment gives a pair of values (x r, y g), i = 1, 2,..., p.

Quantities characterizing various properties of objects can be independent or interrelated. The forms of manifestation of relationships are very diverse. The two most common types are functional (complete) and correlation (incomplete) connections.

When two quantities are functionally dependent on the value of one -x h necessarily corresponds to one or more precisely defined values of another quantity -y ( . Quite often, functional connections appear in physics and chemistry. In real situations, there is an infinitely large number of properties of the object itself and the external environment that influence each other, so this kind of connection does not exist, in other words, functional connections are mathematical abstractions.

The influence of general factors and the presence of objective patterns in the behavior of objects only lead to the manifestation of statistical dependence. Statistical is a dependence in which a change in one of the quantities entails a change in the distribution of others (another), and these other quantities take on certain values with certain probabilities. In this case, functional dependence should be considered a special case of statistical dependence: the value of one factor corresponds to the values of other factors with a probability equal to one. A more important special case of statistical dependence is the correlation dependence, which characterizes the relationship between the values of some random variables and the average value of others, although in each individual case any interrelated value can take on different values.

A correlation relationship (which is also called incomplete, or statistical) appears on average, for mass observations, when the given values of the dependent variable correspond to a certain number of probable values of the independent variable. Explanation - the complexity of the relationships between the analyzed factors, the interaction of which is influenced by unaccounted random variables. Therefore, the connection between the signs appears only on average, in the mass of cases. In a correlation relationship, each argument value corresponds to function values randomly distributed in a certain interval.

The term “correlation” was first used by the French paleontologist J. Cuvier, who derived the “law of correlation of animal parts and organs” (this law allows one to reconstruct the appearance of the entire animal from found body parts). This term was introduced into statistics by the English biologist and statistician F. Galton (not just a relation, but “as if a connection” - corelation).

Correlation dependencies are found everywhere. For example, in agriculture, this could be the relationship between yield and the amount of fertilizer applied. Obviously, the latter are involved in the formation of the crop. But for each specific field or plot, the same amount of applied fertilizer will cause a different increase in yield, since a number of other factors interact (weather, soil condition, etc.), which form the final result. However, on average, such a relationship is observed - an increase in the mass of applied fertilizers leads to an increase in yield.

The simplest method for identifying connections between the characteristics being studied is to construct a correlation table; its visual representation is the correlation field. It is a graph where jq values are plotted on the abscissa axis, and on the ordinate axis y x. By the location of the points and their concentration in a certain direction, one can qualitatively judge the presence of a connection.

Rice. 7.3.

A positive correlation between random variables, close to a parabolic functional one, is shown in Fig. 6.1 , A. In Fig. 6.1, b shows an example of a weak negative correlation, and in Fig. 6.1, V - an example of practically uncorrelated random variables. The correlation is high if the dependence “can be represented” on the graph by a straight line (with a positive or negative slope).

There are two types dependencies between economic phenomena: functional and statistical. Relationship between two quantities X And Y, reflecting two phenomena respectively, is called functional, if each value of the quantity x corresponds to a single value of the quantity Y and vice versa. An example of a functional connection in economics is the dependence of labor productivity on the volume of products produced and the cost of working time. It should be noted that if X is a deterministic, non-random quantity, then a quantity functionally dependent on it Y is also deterministic. If X is a random value, then Y also a random variable.

However, much more often in economics there is not a functional, but statistical dependence, when each fixed value is an independent variable X corresponds not to one, but to many values of the dependent variable Y, and it is impossible to say in advance what value it will take Y. This is due to the fact that on Y except variable X Numerous uncontrolled random factors also influence. In this situation Y is a random variable, and the variable X can be either deterministic or random.

A special case of statistical dependence is correlation dependence, in which the factors are related by functional dependence X and the average value (mathematical expectation) of the effective indicator Y. Statistical dependence can only be revealed based on the results of a sufficiently large number of observations. Graphically, the statistical dependence of two characteristics can be represented using a correlation field, when constructed, the value of the factor characteristic is plotted on the abscissa axis X, and along the ordinate axis – the resultant Y.

Correlation– a special case of a statistical relationship in which different values of a variable correspond to different average values of another variable. Correlation assumes that the variables being studied have a quantitative expression.

If the relationship between two traits is studied, there is a pairwise correlation; if the relationship between many characteristics is studied - multiple correlation.

As an example in Fig.

1 presents data illustrating the direct relationship between X And at(Fig. 1, a) and inverse relationship (Fig. 1, b). In case “a” this is a direct relationship between, for example, average per capita income ( X) and savings ( at) in the family. In case “b” we are talking about an inverse relationship. This is our example, the relationship between labor productivity ( X) and unit cost ( at). In Fig. 1 each point character studies the object of observation with its own values X And at.

Rice. 1. Correlation field

In Fig. 1 also shows straight lines, linear regression equations of type characterizing the functional relationship between the independent variable X and the average value of the effective indicator at. Thus, according to the regression equation, knowing X, only the average value can be restored at.