How to build a variation series. Statistical summary and grouping

Laboratory work No. 1. Primary processing of statistical data

Construction of distribution series

The ordered distribution of population units into groups according to any one characteristic is called near distribution . In this case, the characteristic can be either quantitative, then the series is called variational , and qualitative, then the series is called attributive . So, for example, the population of a city can be distributed by age groups in a variation series, or by professional affiliation in an attribute series (of course, many more qualitative and quantitative characteristics can be proposed for constructing distribution series; the choice of characteristic is determined by the task of statistical research).

Any distribution series is characterized by two elements:

- option(x i) – these are individual values of the characteristic of units in the sample population. For the variation series, the option takes numerical values, for the attributive series – qualitative (for example, x = “civil servant”);

- frequency(n i) – a number showing how many times a particular attribute value occurs. If the frequency is expressed as a relative number (i.e., the proportion of elements of the population corresponding to a given value of the options in the total volume of the population), then it is called relative frequency or frequency.

The variation series can be:

- discrete, when the characteristic being studied is characterized by a certain number (usually an integer).

- interval, when the boundaries “from” and “to” are defined for a continuously varying characteristic. An interval series is also constructed if the set of values of a discretely varied characteristic is large.

An interval series can be constructed both with intervals of equal length (equal-interval series) and with unequal intervals, if this is dictated by the conditions of the statistical study. For example, a series of income distributions with the following intervals can be considered:<5тыс р., 5-10 тыс р., 10-20 тыс.р., 20-50 тыс р., и т.д. Если цель исследования не определяет способ построения интервального ряда, то строится равноинтервальный ряд, число интервалов в котором определяется по формуле Стерджесса:

where k is the number of intervals, n is the sample size. (Of course, the formula usually gives a fractional number, and the nearest integer to the resulting number is chosen as the number of intervals.) The length of the interval in this case is determined by the formula

Graphically, variation series can be presented in the form histograms(above each interval of the interval series a “column” of height corresponding to the frequency in this interval is built), distribution polygon(broken line connecting the points ( x i;n i) or cumulates(built on accumulated frequencies, i.e. for each attribute value, the frequency of occurrence in a set of objects with a attribute value less than the given one is taken).

When working in Excel, the following functions can be used to construct variation series:

CHECK( data array) – to determine the sample size. The argument is the range of cells in which the sample data resides.

COUNTIF( range; criterion) – can be used to construct an attribute or variational series. The arguments are the range of the array of sample values of the attribute and the criterion - the numeric or text value of the attribute or the number of the cell in which it is located. The result is the frequency of occurrence of that value in the sample.

FREQUENCY( array of data; array of intervals) – for constructing a variation series. The arguments are the range of the sample data array and the interval column. If you need to construct a discrete series, then the values of the options are indicated here; if it is an interval series, then the upper boundaries of the intervals (they are also called “pockets”). Since the result is a column of frequencies, you must complete the function entry by pressing CTRL+SHIFT+ENTER. Note that when specifying an array of intervals when introducing a function, you don’t have to specify the last value in it - all values that were not included in the previous “pockets” will be placed in the corresponding “pocket”. This can sometimes help avoid the mistake of not automatically placing the largest sample value in the last pocket.

In addition, for complex groupings (based on several characteristics), use the “pivot tables” tool. They can also be used to construct attribute and variation series, but this unnecessarily complicates the task. Also, to build a variation series and a histogram, there is a “histogram” procedure from the “Analysis Package” add-in (to use add-ins in Excel, you must first download them; they are not installed by default)

Let us illustrate the process of primary data processing with the following examples.

Example 1.1. There is data on the quantitative composition of 60 families.

Construct a variation series and distribution polygon

Solution.

Let's open Excel tables. Let's enter the data array into the range A1:L5. If you are studying a document in electronic form (in Word format, for example), to do this, just select the table with the data and copy it to the clipboard, then select cell A1 and paste the data - they will automatically occupy the appropriate range. Let's calculate the sample size n - the number of sample data; to do this, enter the formula =COUNT(A1:L5) in cell B7. Note that in order to enter the desired range into the formula, it is not necessary to enter its designation from the keyboard; it is enough to select it. Let's determine the minimum and maximum values in the sample by entering the formula =MIN(A1:L5) in cell B8, and in cell B9: =MAX(A1:L5).

Fig.1.1 Example 1. Primary processing of statistical data in Excel tables

Next, we will prepare a table for constructing a variation series by entering names for the column of intervals (variant values) and the frequency column. In the interval column, enter the characteristic values from the minimum (1) to the maximum (6), occupying the range B12:B17. Select the frequency column, enter the formula =FREQUENCY(A1:L5,B12:B17) and press the key combination CTRL+SHIFT+ENTER

Fig. 1.2 Example 1. Construction of a variation series

To control, let’s calculate the sum of frequencies using the SUM function (function icon S in the “Editing” group on the “Home” tab), the calculated sum should coincide with the previously calculated sample volume in cell B7.

Now let’s build a polygon: having selected the resulting frequency range, select the “Graph” command on the “Insert” tab. By default, the values on the horizontal axis will be ordinal numbers - in our case from 1 to 6, which coincides with the values of the options (numbers of tariff categories).

The name of the chart series “series 1” can either be changed using the same “select data” option of the “Design” tab, or simply deleted.

Fig.1.3. Example 1. Construction of a frequency polygon

Example 1.2. There are data on emissions of pollutants from 50 sources:

10,4	18,6	10,3	26,0	45,0	18,2	17,3	19,2	25,8	18,7
28,2	25,2	18,4	17,5	41,8	14,6	10,0	37,8	10,5	16,0
18,1	16,8	38,5	37,7	17,9	29,0	10,1	28,0	12,0	14,0
14,2	20,8	13,5	42,4	15,5	17,9	19,	10,8	12,1	12,4
12,9	12,6	16,8	19,7	18,3	36,8	15,0	37,0	13,0	19,5

Compose an equal-interval series, build a histogram

Solution

Let's enter the data array into an Excel sheet, it will occupy the range A1:J5 As in the previous task, we will determine the sample size n, the minimum and maximum values in the sample. Since now we need not a discrete series, but an interval series, and the number of intervals in the problem is not specified, we calculate the number of intervals k using the Sturgess formula. To do this, enter the formula =1+3.322*LOG10(B7) in cell B10.

Fig.1.4. Example 2. Construction of an equal-interval series

The resulting value is not an integer, it is approximately 6.64. Since with k=7 the length of the intervals will be expressed as an integer (unlike the case of k=6), we choose k=7 by entering this value in cell C10. We calculate the length of the interval d in cell B11 by entering the formula =(B9-B8)/C10.

Let's define an array of intervals, indicating the upper limit for each of the 7 intervals. To do this, in cell E8 we calculate the upper limit of the first interval by entering the formula =B8+B11; in cell E9 the upper limit of the second interval by entering the formula =E8+B11. To calculate the remaining values of the upper boundaries of the intervals, we fix the number of cell B11 in the entered formula using the $ sign, so that the formula in cell E9 takes the form =E8+B$11, and copy the contents of cell E9 to cells E10-E14. The last value obtained is equal to the maximum value in the sample calculated earlier in cell B9.

Fig.1.5. Example 2. Construction of an equal-interval series

Now let’s fill the array of “pockets” using the FREQUENCY function, as was done in example 1.

Fig.1.6. Example 2. Construction of an equal-interval series

Using the resulting variation series, we will construct a histogram: select the frequency column and select “Histogram” on the “Insert” tab. Having received the histogram, let’s change the labels of the horizontal axis in it to values in the range of intervals; to do this, select the “Select data” option of the “Designer” tab. In the window that appears, select the “Change” command for the “Horizontal Axis Labels” section and enter the range of values for the options, selecting it with the mouse.

Fig.1.7. Example 2. Constructing a histogram

Fig.1.8. Example 2. Constructing a histogram

A discrete variation series is constructed for discrete characteristics.

In order to construct a discrete variation series, you need to perform the following steps: 1) arrange the units of observation in increasing order of the studied value of the characteristic,

2) determine all possible values of the attribute x i , arrange them in ascending order,

the value of the attribute, i .

frequency of attribute value and denote f i . The sum of all frequencies of a series is equal to the number of elements in the population being studied.

Example 1 .

List of grades received by students in exams: 3; 4; 3; 5; 4; 2; 2; 4; 4; 3; 5; 2; 4; 5; 4; 3; 4; 3; 3; 4; 4; 2; 2; 5; 5; 4; 5; 2; 3; 4; 4; 3; 4; 5; 2; 5; 5; 4; 3; 3; 4; 2; 4; 4; 5; 4; 3; 5; 3; 5; 4; 4; 5; 4; 4; 5; 4; 5; 5; 5.

Here is the number X - gradeis a discrete random variable, and the resulting list of estimates isstatistical (observable) data .

arrange observation units in ascending order of the studied characteristic value:

2; 2; 2; 2; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5.

2) determine all possible values of the attribute x i, order them in ascending order:

In this example, all estimates can be divided into four groups with the following values: 2; 3; 4; 5.

The value of a random variable corresponding to a particular group of observed data is called the value of the attribute, option (option) and designate x i .

A number that shows how many times the corresponding value of a characteristic occurs in a number of observations is called frequency of attribute value and denote f i .

For our example

score 2 occurs - 8 times,

score 3 occurs - 12 times,

score 4 occurs - 23 times,

rating 5 occurs - 17 times.

There are 60 ratings in total.

4) write the received data into a table of two rows (columns) - x i and f i.

Based on these data, it is possible to construct a discrete variation series

Discrete variation series – this is a table in which the occurring values of the characteristic being studied are indicated as individual values in ascending order and their frequencies

Construction of an interval variation series

In addition to the discrete variation series, a method of grouping data such as an interval variation series is often encountered.

An interval series is constructed if:

the sign has a continuous nature of change;

There were a lot of discrete values (more than 10)

the frequencies of discrete values are very small (do not exceed 1-3 with a relatively large number of observation units);

many discrete values of a feature with the same frequencies.

An interval variation series is a way of grouping data in the form of a table that has two columns (the values of the characteristic in the form of an interval of values and the frequency of each interval).

Unlike a discrete series, the values of the characteristic of an interval series are represented not by individual values, but by an interval of values (“from - to”).

The number that shows how many observation units fell into each selected interval is called frequency of attribute value and denote f i . The sum of all frequencies of a series is equal to the number of elements (units of observation) in the population being studied.

If a unit has a characteristic value equal to the upper limit of the interval, then it should be assigned to the next interval.

For example, a child with a height of 100 cm will fall into the 2nd interval, and not into the first; and a child with a height of 130 cm will fall into the last interval, and not into the third.

Based on these data, an interval variation series can be constructed.

Each interval has a lower bound (xn), an upper bound (xw) and an interval width ( i).

The interval boundary is the value of the attribute that lies on the border of two intervals.

children's height (cm)	children's height (cm)	number of children




more than 130

If an interval has an upper and lower boundary, then it is called closed interval. If an interval has only a lower or only an upper boundary, then it is - open interval. Only the very first or the very last interval can be open. In the above example, the last interval is open.

Interval width (i) – the difference between the upper and lower limits.

i = x n - x in

The width of an open interval is assumed to be the same as the width of the adjacent closed interval.

children's height (cm)		number of children	Interval width (i)
		number of children	Interval width (i)



	for calculations 130+20=150		20 (because the width of the adjacent closed interval is 20)

All interval series are divided into interval series with equal intervals and interval series with unequal intervals . In spaced rows with equal intervals, the width of all intervals is the same. In interval series with unequal intervals, the width of the intervals is different.

In the example under consideration - an interval series with unequal intervals.

Condition:

There is data on the age composition of workers (years): 18, 38, 28, 29, 26, 38, 34, 22, 28, 30, 22, 23, 35, 33, 27, 24, 30, 32, 28, 25, 29, 26, 31, 24, 29, 27, 32, 25, 29, 29.

1. Construct an interval distribution series.
2. Construct a graphical representation of the series.
3. Graphically determine the mode and median.

Solution:

1) According to the Sturgess formula, the population must be divided into 1 + 3.322 lg 30 = 6 groups.

Maximum age - 38, minimum - 18.

Interval width Since the ends of the intervals must be integers, we divide the population into 5 groups. Interval width - 4.

To make calculations easier, we will arrange the data in ascending order: 18, 22, 22, 23, 24, 24, 25, 25, 26, 26, 27, 27, 28, 28, 28, 29, 29, 29, 29, 29, 30 , 30, 31, 32, 32, 33, 34, 35, 38, 38.

Age distribution of workers

Graphically, a series can be depicted as a histogram or polygon. Histogram - bar chart. The base of the column is the width of the interval. The height of the column is equal to the frequency.

Polygon (or distribution polygon) - frequency graph. To build it using a histogram, we connect the midpoints of the upper sides of the rectangles. We close the polygon on the Ox axis at distances equal to half the interval from the extreme x values.

Mode (Mo) is the value of the characteristic being studied, which occurs most frequently in a given population.

To determine the mode from a histogram, you need to select the highest rectangle, draw a line from the right vertex of this rectangle to the upper right corner of the previous rectangle, and from the left vertex of the modal rectangle draw a line to the left vertex of the subsequent rectangle. From the intersection of these lines, draw a perpendicular to the x-axis. The abscissa will be fashion. Mo ≈ 27.5. This means that the most common age in this population is 27-28 years old.

Median (Me) is the value of the characteristic being studied, which is in the middle of the ordered variation series.

We find the median using the cumulate. Cumulates - a graph of accumulated frequencies. Abscissas are variants of a series. Ordinates are accumulated frequencies.

To determine the median over the cumulate, we find a point along the ordinate axis corresponding to 50% of the accumulated frequencies (in our case, 15), draw a straight line through it, parallel to the Ox axis, and from the point of its intersection with the cumulate, draw a perpendicular to the x axis. The abscissa is the median. Me ≈ 25.9. This means that half of the workers in this population are under 26 years of age.

2. The concept of distribution series. Discrete and interval distribution series

Distribution rows are called groupings of a special type in which for each characteristic, group of characteristics or class of characteristics the number of units in the group or the proportion of this number in the total is known. Those. distribution series– an ordered set of attribute values, arranged in ascending or descending order with their corresponding weights. Distribution series can be constructed either by quantitative or attribute characteristics.

Distribution series constructed on a quantitative basis are called variation series. They happen discrete and interval. A distribution series can be constructed based on a continuously varying characteristic (when the characteristic can take any values within any interval) and on a discretely varying characteristic (it takes strictly defined integer values).

Discrete A variation series of a distribution is a ranked set of options with their corresponding frequencies or particulars. Variants of a discrete series are discretely continuously changing values of a characteristic, usually the result of a count.

Discrete

Variation series are usually constructed if the values of the characteristic being studied may differ from each other by no less than a certain finite amount. In discrete series, point values of the characteristic are specified. Example : Distribution of men's suits sold by stores per month by size.

Interval

A variation series is an ordered set of intervals of varying the values of a random variable with the corresponding frequencies or frequencies of values of the variable falling into each of them. Interval series are designed to analyze the distribution of a continuously changing characteristic, the value of which is most often recorded by measurement or weighing. Variants of such a series are groupings.

Example : Distribution of purchases in a grocery store by amount.

If in discrete variation series the frequency response relates directly to a variant of the series, then in interval series it refers to a group of variants.

It is convenient to analyze distribution series using their graphical representation, which allows one to judge the shape of the distribution and patterns. A discrete series is depicted on a graph as a broken line - distribution polygon. To construct it, in a rectangular coordinate system, the ranked (ordered) values of the varying characteristic are plotted along the abscissa axis on the same scale, and a scale for expressing frequencies is plotted along the ordinate axis.

Interval series are depicted as distribution histograms(that is, bar charts).

When constructing a histogram, the values of the intervals are plotted on the abscissa axis, and the frequencies are depicted by rectangles built on the corresponding intervals. The height of the columns in the case of equal intervals should be proportional to the frequencies.

Any histogram can be converted into a distribution polygon; to do this, it is necessary to connect the vertices of its rectangles with straight segments.

2. Index method for analyzing the influence of average output and average headcount on changes in production volume

Index method used to analyze the dynamics and compare general indicators, as well as factors influencing changes in the levels of these indicators. Using indices, it is possible to identify the influence of average output and average headcount on changes in production volume. This problem is solved by constructing a system of analytical indices.

The output volume index is related to the average number of employees and the average output index in the same way as production volume (Q) is related to output ( w) and numbers ( r) .

We can conclude that the volume of production will be equal to the product of average output and average headcount:

Q = w r, where Q is the volume of production,

w - average output,

r – average number of employees.

As you can see, we are talking about the relationship of phenomena in statics: the product of two factors gives the total volume of the resulting phenomenon. It is also obvious that this connection is functional; therefore, the dynamics of this connection are studied using indices. For the example given, this is the following system:

Jw × Jr = Jwr.

For example, the production volume index Jwr, as an index of a productive phenomenon, can be decomposed into two factor indexes: the average output index (Jw) and the average headcount index (Jr):

Index Index Index

volume of average payroll

production output number

Where J w- labor productivity index calculated using the Laspeyres formula;

J r- index of the number of employees, calculated using the Paasche formula.

Index systems are used to determine the influence of individual factors on the formation of the level of a performance indicator; they allow the value of an unknown to be determined from 2 known index values.

Based on the above system of indices, one can also find the absolute increase in production volume, decomposed into the influence of factors.

1. General increase in production volume:

∆wr = ∑w 1 r 1 - ∑w 0 r 0 .

2. Increase due to the action of the average output indicator:

∆wr/w = ∑w 1 r 1 - ∑w 0 r 1 .

3. Increase due to the action of the average headcount indicator:

∆wr/r = ∑w 0 r 1 - ∑w 0 r 0

∆wr = ∆wr/w + ∆wr/r.

Example. The following data is known

We can determine how production volume has changed in relative and absolute terms and how individual factors influenced this change.

The volume of production was:

in the base period

w 0 * r 0 = 2000 * 90 = 180000,

and in the reporting

w 1 * r 1 = 2100 * 100 = 210000.

Consequently, the volume of production increased by 30,000 or 1.16%.

∆wr=∑w 1 r 1 -∑w 0 r 0= (210000-180000)=30000

or (210000:180000)*100%=1.16%.

This change in production volume was due to:

1) an increase in the average headcount by 10 people or 111.1%

r 1 / r 0 = 100 / 90 = 1.11 or 111.1%.

In absolute terms, due to this factor, the volume of production increased by 20,000:

w 0 r 1 – w 0 r 0 = w 0 (r 1 -r 0) = 2000 (100-90) = 20000.

2) an increase in average output by 105% or 10,000:

w 1 r 1 /w 0 r 1 = 2100*100/2000*100 = 1.05 or 105%.

In absolute terms, the increase is:

w 1 r 1 – w 0 r 1 = (w 1 -w 0)r 1 = (2100-2000)*100 = 10000.

Hence, the combined influence of factors was:

1. In absolute terms

10000 + 20000 = 30000

2. In relative terms

1,11 * 1,05 = 1,16 (116%)

Therefore, the increase is 1.16%. Both results were obtained previously.

The word “index” in translation means pointer, indicator. In statistics, an index is interpreted as a relative indicator that characterizes a change in a phenomenon in time, space, or compared to a plan. Since the index is a relative value, the names of the indices are consonant with the names of the relative values.

In cases where we analyze changes over time in compared products, we can raise the question of how the components of the index (price, physical volume, structure of production or sales of individual types of products) change under different conditions (in different areas). In this regard, indices of constant composition, variable composition, and structural changes are constructed.

Index of permanent (fixed) composition – this is an index that characterizes the dynamics of the average value for the same fixed structure of the population.

The principle of constructing an index of constant composition is to eliminate the impact of changes in the structure of weights on the indexed value by calculating the weighted average level of the indexed indicator with the same weights.

The constant composition index is identical in form to the aggregate index. The aggregate form is the most common.

The index of constant composition is calculated with weights fixed at the level of one period and shows the change only in the indexed value. The index of constant composition eliminates the impact of changes in the structure of weights on the indexed value by calculating the weighted average level of the indexed indicator with the same weights. Indices of constant composition compare indicators calculated on the basis of a constant structure of phenomena.

When processing large amounts of information, which is especially important when carrying out modern scientific developments, the researcher faces the serious task of correctly grouping the source data. If the data is discrete in nature, then, as we have seen, no problems arise - you just need to calculate the frequency of each feature. If the characteristic under study has continuous nature (which is more common in practice), then choosing the optimal number of feature grouping intervals is by no means a trivial task.

To group continuous random variables, the entire variational range of the characteristic is divided into a certain number of intervals To.

Grouped interval (continuous) variation series are called intervals ranked by the value of the attribute (), where the numbers of observations falling into the i"th interval, or relative frequencies (), are indicated together with the corresponding frequencies ():

Characteristic value intervals
mi frequency

Histogram And cumulate (ogiva), already discussed in detail by us, are an excellent means of data visualization, allowing you to get a primary idea of the data structure. Such graphs (Fig. 1.15) are constructed for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fills the region of its possible values, taking on any values.

Rice. 1.15.

That's why the columns on the histogram and the cumulate must touch each other and have no areas where the attribute values do not fall within all possible(i.e., the histogram and cumulates should not have “holes” along the abscissa axis, which do not contain the values of the variable being studied, as in Fig. 1.16). The height of the bar corresponds to frequency—the number of observations falling within a given interval, or relative frequency—the proportion of observations. Intervals must not intersect and are usually the same width.

Rice. 1.16.

The histogram and polygon are approximations of the probability density curve (differential function) f(x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is so important in the primary statistical processing of quantitative continuous data - by their appearance one can judge the hypothetical distribution law.

Cumulate – a curve of accumulated frequencies (frequencies) of an interval variation series. The graph of the cumulative distribution function is compared with the cumulate F(x), also discussed in the probability theory course.

Basically, the concepts of histogram and cumulate are associated specifically with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task is perhaps the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, as this will make the histogram too smooth ( oversmoothed), loses all the features of variability of the original data - in Fig. 1.17 you can see how the same data on which the graphs in Fig. 1.15, used to construct a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the studied data along the numerical axis: the histogram will be under-smoothed (undersmoothed), with empty intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How to determine the most preferable number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to divide the original set of values of the characteristic being studied. This formula has truly become extremely popular - most statistical textbooks offer it, and many statistical packages use it by default. How justified this is and in all cases is a very serious question.

So, what is the Sturges formula based on?

Consider the binomial distribution)

How to build a variation series. Statistical summary and grouping

Construction of an interval variation series