How to calculate standard deviation. Standard deviation

Values obtained from experience inevitably contain errors due to a wide variety of reasons. Among them, one should distinguish between systematic and random errors. Systematic errors are caused by reasons that act in a very specific way, and can always be eliminated or taken into account quite accurately. Random errors are caused by a very large number of individual causes that cannot be accurately accounted for and act in different ways in each individual measurement. These errors cannot be completely excluded; they can only be taken into account on average, for which it is necessary to know the laws that govern random errors.

We will denote the measured quantity by A, and the random error in the measurement by x. Since the error x can take on any value, it is a continuous random variable, which is fully characterized by its distribution law.

The simplest and most accurately reflecting reality (in the vast majority of cases) is the so-called normal error distribution law:

This distribution law can be obtained from various theoretical premises, in particular, from the requirement that the most probable value of an unknown quantity for which a series of values with the same degree of accuracy is obtained by direct measurement is the arithmetic mean of these values. Quantity 2 is called dispersion of this normal law.

Arithmetic mean

Determination of dispersion from experimental data. If for any value A, n values a i are obtained by direct measurement with the same degree of accuracy and if the errors of value A are subject to the normal distribution law, then the most probable value of A will be arithmetic mean:

a - arithmetic mean,

a i - measured value at the i-th step.

Deviation of the observed value (for each observation) a i of value A from arithmetic mean: a i - a.

To determine the variance of the normal error distribution law in this case, use the formula:

2 - dispersion,
a - arithmetic mean,
n - number of parameter measurements,

Standard deviation

Standard deviation shows the absolute deviation of the measured values from arithmetic mean. In accordance with the formula for the measure of accuracy of a linear combination mean square error The arithmetic mean is determined by the formula:

, Where

a - arithmetic mean,
n - number of parameter measurements,
a i - measured value at the i-th step.

Coefficient of variation

Coefficient of variation characterizes the relative measure of deviation of measured values from arithmetic mean:

, Where

V - coefficient of variation,
- standard deviation,
a - arithmetic mean.

The higher the value coefficient of variation, the relatively greater the scatter and less uniformity of the studied values. If coefficient of variation less than 10%, then the variability of the variation series is considered to be insignificant, from 10% to 20% is considered average, more than 20% and less than 33% is considered significant and if coefficient of variation exceeds 33%, this indicates the heterogeneity of information and the need to exclude the largest and smallest values.

Average linear deviation

One of the indicators of the scope and intensity of variation is average linear deviation(average deviation module) from the arithmetic mean. Average linear deviation calculated by the formula:

, Where

_
a - average linear deviation,
a - arithmetic mean,
n - number of parameter measurements,
a i - measured value at the i-th step.

To check the compliance of the studied values with the law of normal distribution, the relation is used asymmetry indicator to his mistake and attitude kurtosis indicator to his mistake.

Asymmetry indicator

Asymmetry indicator(A) and its error (m a) is calculated using the following formulas:

, Where

A - asymmetry indicator,
- standard deviation,
a - arithmetic mean,
n - number of parameter measurements,
a i - measured value at the i-th step.

Kurtosis indicator

Kurtosis indicator(E) and its error (m e) is calculated using the following formulas:

, Where

X i - random (current) variables;

X̅– the average value of random variables for the sample is calculated using the formula:

So, variance is the average square of deviations . That is, the average value is first calculated, then taken the difference between each original and average value is squared , is added and then divided by the number of values in the given population.

The difference between an individual value and the average reflects the measure of deviation. It is squared so that all deviations become exclusively positive numbers and to avoid mutual destruction of positive and negative deviations when summing them up. Then, given the squared deviations, we simply calculate the arithmetic mean.

The answer to the magic word “dispersion” lies in just these three words: average - square - deviations.

Standard deviation (MSD)

Taking the square root of the variance, we obtain the so-called “ standard deviation". There are names "standard deviation" or "sigma" (from the name of the Greek letter σ .). The formula for the standard deviation is:

So, dispersion is sigma squared, or is the standard deviation squared.

Therefore, in methods of statistical data processing in real estate assessments, depending on the required accuracy of the task, the rule of two or three sigma is used.

To compare the two-sigma rule and the three-sigma rule, we use Laplace’s formula:

F - F ,

where Ф(x) is the Laplace function;

Minimum value

β = maximum value

s = sigma value (standard deviation)

a = average

In this case, a particular form of Laplace’s formula is used when the boundaries α and β of the values of the random variable X are equally spaced from the center of the distribution a = M(X) by a certain value d: a = a-d, b = a+d.

(1) Formula (1) determines the probability of a given deviation d of a random variable X with a normal distribution law from its mathematical expectation M(X) = a. If in formula (1) we take sequentially d = 2s and d = 3s, we obtain: (2), (3).

Two sigma rule

It can be almost reliably (with a confidence probability of 0.954) that all values of a random variable X with a normal distribution law deviate from its mathematical expectation M(X) = a by an amount not greater than 2s (two standard deviations). Confidence probability (Pd) is the probability of events that are conventionally accepted as reliable (their probability is close to 1).

Let's illustrate the two-sigma rule geometrically. In Fig. Figure 6 shows a Gaussian curve with the distribution center a. The area bounded by the entire curve and the Ox axis is equal to 1 (100%), and the area of the curvilinear trapezoid between the abscissas a–2s and a+2s, according to the two-sigma rule, is equal to 0.954 (95.4% of the total area). The area of the shaded areas is 1-0.954 = 0.046 (»5% of the total area). These areas are called the critical region of the random variable. Values of a random variable falling into the critical region are unlikely and in practice are conventionally accepted as impossible.

The probability of conditionally impossible values is called the significance level of a random variable. The significance level is related to the confidence probability by the formula:

where q is the significance level expressed as a percentage.

Three sigma rule

When solving issues that require greater reliability, when the confidence probability (Pd) is taken equal to 0.997 (more precisely, 0.9973), instead of the two-sigma rule, according to formula (3), the rule is used three sigma

According to three sigma rule with a confidence probability of 0.9973, the critical area will be the area of attribute values outside the interval (a-3s, a+3s). The significance level is 0.27%.

In other words, the probability that the absolute value of the deviation will exceed three times the standard deviation is very small, namely 0.0027 = 1-0.9973. This means that only 0.27% of cases will this happen. Such events, based on the principle of the impossibility of unlikely events, can be considered practically impossible. Those. sampling is highly accurate.

This is the essence of the three sigma rule:

If a random variable is distributed normally, then the absolute value of its deviation from the mathematical expectation does not exceed three times the standard deviation (MSD).

In practice, the three-sigma rule is applied as follows: if the distribution of the random variable being studied is unknown, but the condition specified in the above rule is met, then there is reason to assume that the variable being studied is normally distributed; otherwise it is not normally distributed.

The level of significance is taken depending on the permitted degree of risk and the task at hand. For real estate valuation, a less precise sample is usually adopted, following the two-sigma rule.

In this article I will talk about how to find standard deviation. This material is extremely important for a full understanding of mathematics, so a math tutor should devote a separate lesson or even several to studying it. In this article you will find a link to a detailed and understandable video tutorial that explains what standard deviation is and how to find it.

Standard deviation makes it possible to evaluate the spread of values obtained as a result of measuring a certain parameter. Indicated by the symbol (Greek letter "sigma").

The formula for calculation is quite simple. To find the standard deviation, you need to take the square root of the variance. So now you have to ask, “What is variance?”

What is variance

The definition of variance goes like this. Dispersion is the arithmetic mean of the squared deviations of values from the mean.

To find the variance, perform the following calculations sequentially:

Determine the average (simple arithmetic average of a series of values).
Then subtract the average from each value and square the resulting difference (you get squared difference).
The next step is to calculate the arithmetic mean of the resulting squared differences (You can find out why exactly the squares below).

Let's look at an example. Let's say you and your friends decide to measure the height of your dogs (in millimeters). As a result of the measurements, you received the following height measurements (at the withers): 600 mm, 470 mm, 170 mm, 430 mm and 300 mm.

Let's calculate the mean, variance and standard deviation.

First let's find the average value. As you already know, to do this you need to add up all the measured values and divide by the number of measurements. Calculation progress:

Average mm.

So, the average (arithmetic mean) is 394 mm.

Now we need to determine deviation of the height of each dog from the average:

Finally, to calculate variance, we square each of the resulting differences, and then find the arithmetic mean of the results obtained:

Dispersion mm 2 .

Thus, the dispersion is 21704 mm 2.

How to find standard deviation

So how can we now calculate the standard deviation, knowing the variance? As we remember, take the square root of it. That is, the standard deviation is equal to:

Mm (rounded to the nearest whole number in mm).

Using this method, we found that some dogs (for example, Rottweilers) are very large dogs. But there are also very small dogs (for example, dachshunds, but you shouldn’t tell them that).

The most interesting thing is that the standard deviation carries useful information. Now we can show which of the obtained height measurement results are within the interval that we get if we plot the standard deviation from the average (to both sides of it).

That is, using the standard deviation, we obtain a “standard” method that allows us to find out which of the values is normal (statistically average), and which is extraordinarily large or, conversely, small.

What is standard deviation

But... everything will be a little different if we analyze sample data. In our example we considered general population. That is, our 5 dogs were the only dogs in the world that interested us.

But if the data is a sample (values selected from a large population), then the calculations need to be done differently.

If there are values, then:

All other calculations are carried out similarly, including the determination of the average.

For example, if our five dogs are just a sample of the population of dogs (all dogs on the planet), we must divide by 4, not 5, namely:

Sample variance = mm 2.

In this case, the standard deviation for the sample is equal to mm (rounded to the nearest whole number).

We can say that we have made some “correction” in the case where our values are just a small sample.

Note. Why exactly squared differences?

But why do we take exactly the squared differences when calculating the variance? Let's say when measuring some parameter, you received the following set of values: 4; 4; -4; -4. If we simply add the absolute deviations from the average (differences) together... the negative values cancel out with the positive ones:

It turns out that this option is useless. Then maybe it’s worth trying the absolute values of the deviations (that is, the modules of these values)?

At first glance, it turns out well (the resulting value, by the way, is called the mean absolute deviation), but not in all cases. Let's try another example. Let the measurement result in the following set of values: 7; 1; -6; -2. Then the average absolute deviation is:

Wow! Again we got a result of 4, although the differences have a much larger spread.

Now let's see what happens if we square the differences (and then take the square root of their sum).

For the first example it will be:

For the second example it will be:

Now it’s a completely different matter! The greater the spread of the differences, the greater the standard deviation... which is what we were aiming for.

In fact, this method uses the same idea as when calculating the distance between points, only applied in a different way.

And from a mathematical point of view, using squares and square roots provides more benefits than we could get from absolute deviation values, making standard deviation applicable to other mathematical problems.

Sergey Valerievich told you how to find the standard deviation

An approximate method for assessing the variability of a variation series is to determine the limit and amplitude, but the values of the variant within the series are not taken into account. The main generally accepted measure of the variability of a quantitative characteristic within a variation series is standard deviation (σ - sigma). The larger the standard deviation, the higher the degree of fluctuation of this series.

The method for calculating the standard deviation includes the following steps:

1. Find the arithmetic mean (M).

2. Determine the deviations of individual options from the arithmetic mean (d=V-M). In medical statistics, deviations from the average are designated as d (deviate). The sum of all deviations is zero.

3. Square each deviation d 2.

4. Multiply the squares of the deviations by the corresponding frequencies d 2 *p.

5. Find the sum of the products å(d 2 *p)

6. Calculate the standard deviation using the formula:

When n is greater than 30, or when n is less than or equal to 30, where n is the number of all options.

Standard deviation value:

1. The standard deviation characterizes the spread of the variant relative to the average value (i.e., the variability of the variation series). The greater the sigma, the higher the degree of diversity of this series.

2. The standard deviation is used for a comparative assessment of the degree of correspondence of the arithmetic mean to the variation series for which it was calculated.

Variations of mass phenomena obey the law of normal distribution. The curve representing this distribution looks like a smooth bell-shaped symmetrical curve (Gaussian curve). According to the theory of probability, in phenomena that obey the law of normal distribution, there is a strict mathematical relationship between the values of the arithmetic mean and the standard deviation. The theoretical distribution of a variant in a homogeneous variation series obeys the three-sigma rule.

If in a system of rectangular coordinates the values of a quantitative characteristic (variants) are plotted on the abscissa axis, and the frequency of occurrence of a variant in a variation series is plotted on the ordinate axis, then variants with larger and smaller values are evenly located on the sides of the arithmetic mean.

It has been established that with a normal distribution of the trait:

68.3% of the variant values are within M±1s

95.5% of the variant values are within M±2s

99.7% of the variant values are within M±3s

3. The standard deviation allows you to establish normal values for clinical and biological parameters. In medicine, the interval M±1s is usually taken as the normal range for the phenomenon being studied. The deviation of the estimated value from the arithmetic mean by more than 1s indicates a deviation of the studied parameter from the norm.

4. In medicine, the three-sigma rule is used in pediatrics for individual assessment of the level of physical development of children (sigma deviation method), for the development of standards for children's clothing

5. The standard deviation is necessary to characterize the degree of diversity of the characteristic being studied and to calculate the error of the arithmetic mean.

The value of the standard deviation is usually used to compare the variability of series of the same type. If two series with different characteristics are compared (height and weight, average duration of hospital treatment and hospital mortality, etc.), then a direct comparison of sigma sizes is impossible , because standard deviation is a named value expressed in absolute numbers. In these cases, use coefficient of variation (Cv), which is a relative value: the percentage ratio of the standard deviation to the arithmetic mean.

The coefficient of variation is calculated using the formula:

The higher the coefficient of variation , the greater the variability of this series. It is believed that a coefficient of variation of more than 30% indicates the qualitative heterogeneity of the population.

It is worth noting that this calculation of variance has a drawback - it turns out to be biased, i.e. its mathematical expectation is not equal to the true value of the variance. Read more about this. At the same time, not everything is so bad. As the sample size increases, it still approaches its theoretical analogue, i.e. is asymptotically unbiased. Therefore, when working with large sample sizes, you can use the formula above.

It is useful to translate the language of signs into the language of words. It turns out that the variance is the average square of the deviations. That is, the average value is first calculated, then the difference between each original and average value is taken, squared, added, and then divided by the number of values in the population. The difference between an individual value and the average reflects the measure of deviation. It is squared so that all deviations become exclusively positive numbers and to avoid mutual destruction of positive and negative deviations when summing them up. Then, given the squared deviations, we simply calculate the arithmetic mean. Average - square - deviations. The deviations are squared and the average is calculated. The solution lies in just three words.

However, in its pure form, such as the arithmetic mean, or index, dispersion is not used. It is rather an auxiliary and intermediate indicator that is necessary for other types of statistical analysis. It doesn't even have a normal unit of measurement. Judging by the formula, this is the square of the unit of measurement of the original data. Without a bottle, as they say, you can’t figure it out.

(module 111)

In order to return the variance to reality, that is, to use it for more mundane purposes, the square root is extracted from it. It turns out the so-called standard deviation (RMS). There are names “standard deviation” or “sigma” (from the name of the Greek letter). The standard deviation formula is:

To obtain this indicator for the sample, use the formula:

As with variance, there is a slightly different calculation option. But as the sample grows, the difference disappears.

The standard deviation, obviously, also characterizes the measure of data dispersion, but now (unlike dispersion) it can be compared with the original data, since they have the same units of measurement (this is clear from the calculation formula). But even this indicator in its pure form is not very informative, since it contains too many intermediate calculations that are confusing (deviation, squared, sum, average, root). However, it is already possible to work directly with the standard deviation, because the properties of this indicator are well studied and known. For example, there is this three sigma rule, which states that the data has 997 values out of 1000 within ±3 sigma of the arithmetic mean. Standard deviation, as a measure of uncertainty, is also involved in many statistical calculations. With its help, the degree of accuracy of various estimates and forecasts is determined. If the variation is very large, then the standard deviation will also be large, and therefore the forecast will be inaccurate, which will be expressed, for example, in very wide confidence intervals.

Coefficient of variation

The standard deviation gives an absolute estimate of the measure of dispersion. Therefore, to understand how large the spread is relative to the values themselves (i.e., regardless of their scale), a relative indicator is required. This indicator is called coefficient of variation and is calculated using the following formula:

The coefficient of variation is measured as a percentage (if multiplied by 100%). Using this indicator, you can compare a variety of phenomena, regardless of their scale and units of measurement. This fact is what makes the coefficient of variation so popular.

In statistics, it is accepted that if the value of the coefficient of variation is less than 33%, then the population is considered homogeneous; if it is more than 33%, then it is heterogeneous. It's difficult for me to comment on anything here. I don’t know who defined this and why, but it is considered an axiom.

I feel that I am carried away by dry theory and need to bring something visual and figurative. On the other hand, all variation indicators describe approximately the same thing, only they are calculated differently. Therefore, it is difficult to show off a variety of examples. Only the values of indicators can differ, but not their essence. So let’s compare how the values of different variation indicators differ for the same set of data. Let's take the example of calculating the average linear deviation (from ). Here are the source data:

And a schedule to remind you.

Using these data, we calculate various indicators of variation.

The average value is the usual arithmetic average.

The range of variation is the difference between the maximum and minimum:

The average linear deviation is calculated using the formula:

Standard Deviation:

Let's summarize the calculation in a table.

As can be seen, the linear mean and standard deviation give similar values for the degree of data variation. Variance is sigma squared, so it will always be a relatively large number, which, in fact, does not mean anything. The range of variation is the difference between extreme values and can speak volumes.

Let's summarize some results.

Variation of an indicator reflects the variability of a process or phenomenon. Its degree can be measured using several indicators.

1. Range of variation - the difference between the maximum and minimum. Reflects the range of possible values.
2. Average linear deviation – reflects the average of the absolute (modulo) deviations of all values of the analyzed population from their average value.
3. Dispersion - the average square of deviations.
4. Standard deviation is the root of the dispersion (the mean square of deviations).
5. The coefficient of variation is the most universal indicator, reflecting the degree of scattering of values, regardless of their scale and units of measurement. The coefficient of variation is measured as a percentage and can be used to compare the variation of different processes and phenomena.

Thus, in statistical analysis there is a system of indicators that reflect the homogeneity of phenomena and the stability of processes. Often variation indicators do not have independent meaning and are used for further data analysis (calculation of confidence intervals