Skewness and kurtosis of distribution. Calculating skewness and kurtosis of an empirical distribution in Excel

When analyzing variation series, the displacement from the center and the slope of the distribution are characterized by special indicators. Empirical distributions, as a rule, are shifted from the center of the distribution to the right or left, and are asymmetric. The normal distribution is strictly symmetrical about the arithmetic mean, which is due to the parity of the function.

Skewness of distribution arises due to the fact that some factors act more strongly in one direction than in another, or the process of development of the phenomenon is such that some cause dominates. In addition, the nature of some phenomena is such that there is an asymmetrical distribution.

The simplest measure of asymmetry is the difference between the arithmetic mean, mode and median:

To determine the direction and magnitude of the shift (asymmetry) of the distribution, it is calculated asymmetry coefficient , which is a normalized moment of third order:

As= 3 / 3, where  3 is the third-order central moment;  3 – standard deviation cubed. 3 = (m 3 – 3m 1 m 2 + 2m 1 3)k 3 .

For left-sided asymmetry asymmetry coefficient (As<0), при правосторонней (As>0) .

If the top of the distribution is shifted to the left and the right part of the branch turns out to be longer than the left, then such asymmetry is right-sided, otherwise left-handed .

The relationship between the mode, median and arithmetic mean in symmetric and asymmetric series allows us to use a simpler indicator as a measure of asymmetry asymmetry coefficient Pearson :

K a = ( –Mo)/. If K a >0, then the asymmetry is right-sided, if K a<0, то асимметрия левосторонняя, при К a =0 ряд считается симметричным.

Asymmetry can be more accurately determined using the third-order central moment:

, where 3 = (m 3 – 3m 1 m 2 + 2m 1 3)k 3 .

If > 0, then the asymmetry can be considered significant if < 0,25 асимметрию можно считать не значительной.

To characterize the degree of deviation of a symmetric distribution from a normal distribution along the ordinate, an indicator of peakiness, the steepness of the distribution, called excess :

Ex = ( 4 / 4) – 3, where:  4 – fourth-order central moment.

For a normal distribution, Ex = 0, i.e.  4 / 4 = 3.  4 = (m 4 – 4m 3 m 1 + 6m 2 m 2 1 – 3 m 4 1)* k 4 .

High-peak curves have a positive kurtosis, while low-peak curves have a negative kurtosis (Fig. D.2).

Indicators of kurtosis and skewness are necessary in statistical analysis to determine the heterogeneity of the population, the asymmetry of the distribution, and the proximity of the empirical distribution to the normal law. With significant deviations of the asymmetry and kurtosis indicators from zero, the population cannot be considered homogeneous and the distribution close to normal. Comparison of actual curves with theoretical ones allows one to mathematically substantiate the obtained statistical results, establish the type and nature of the distribution of socio-economic phenomena, and predict the likelihood of the occurrence of the events being studied.

4.7. Justification of the closeness of the empirical (actual) distribution to the theoretical normal distribution. Normal distribution (Gauss-Laplace law) and its characteristics. "The Three Sigma Rule." Goodness-of-fit criteria (using the example of the Pearson or Kolgomogorov criterion).

You can notice a certain connection in the change in frequencies and values ​​of the varying characteristic. As the value of the attribute increases, frequencies first increase and then, after reaching a certain maximum value, decrease. Such regular changes in frequencies in variation series are called distribution patterns.

To identify a distribution pattern, it is necessary that the variation series contain a sufficiently large number of units, and that the series themselves represent qualitatively homogeneous populations.

A distribution polygon constructed based on actual data is empirical (actual) distribution curve, reflecting not only objective (general), but also subjective (random) distribution conditions that are not characteristic of the phenomenon being studied.

In practical work, the distribution law is found by comparing the empirical distribution with one of the theoretical ones and assessing the degree of difference or correspondence between them. Theoretical distribution curve reflects in its pure form, without taking into account the influence of random factors, the general pattern of frequency distribution (distribution density) depending on the values ​​of varying characteristics.

Various types of theoretical distributions are common in statistics: normal, binomial, Poisson, etc. Each of the theoretical distributions has its own specifics and scope.

Normal distribution law characteristic of the distribution of equally probable events occurring during the interaction of many random factors. The law of normal distribution underlies statistical methods for estimating distribution parameters, representativeness of sample observations, and measuring the relationship of mass phenomena. To check how well the actual distribution corresponds to the normal one, it is necessary to compare the frequencies of the actual distribution with the theoretical frequencies characteristic of the normal distribution law. These frequencies are a function of normalized deviations. Therefore, based on the data of the empirical distribution series, normalized deviations t are calculated. Then the corresponding theoretical frequencies are determined. This flattens the empirical distribution.

Normal distribution or the Gauss-Laplace law is described by the equation
, where y t is the ordinate of the normal distribution curve, or the frequency (probability) of the value x of the normal distribution; – mathematical expectation (average value) of individual x values. If the values ​​(x – ) measure (express) in terms of standard deviation , i.e. in standardized (normalized) deviations t = (x – )/, then the formula will take the form:
. The normal distribution of socio-economic phenomena in its pure form is rare, however, if the homogeneity of the population is maintained, the actual distributions are often close to normal. The pattern of distribution of the studied quantities is revealed by checking the compliance of the empirical distribution with the theoretical normal distribution law. To do this, the actual distribution is aligned with the normal curve and calculated consent criteria .

The normal distribution is characterized by two significant parameters that determine the center of grouping of individual values ​​and the shape of the curve: the arithmetic mean and standard deviation . Normal distribution curves differ in the position of the distribution center on the x-axis and a scatter option around this center  (Fig. 4.1 and 4.2). A feature of the normal distribution curve is its symmetry relative to the center of the distribution - on both sides of its middle, two uniformly decreasing branches are formed, asymptotically approaching the abscissa axis. Therefore, in a normal distribution, the mean, mode and median are the same: = Mo = Me.

  x

The normal distribution curve has two inflection points (transition from convexity to concavity) at t = 1, i.e. when options deviate from the average (x – ), equal to the standard deviation . Within  with a normal distribution is 68.3%, within 2 – 95.4%, within 3 – 99.7% of the number of observations or frequencies of the distribution series. In practice, there are almost no deviations exceeding 3therefore, the given relationship is called “ three sigma rule ».

To calculate theoretical frequencies, the formula is used:

.

Magnitude
is a function of t or the density of the normal distribution, which is determined from a special table, excerpts from which are given in table. 4.2.

Normal distribution density values ​​Table 4.2

Graph in Fig. 4.3 clearly demonstrates the closeness of the empirical (2) and normal (1) distributions.

Rice. 4.3. Distribution of postal service branches by number

workers: 1 – normal; 2 – empirical

To mathematically substantiate the closeness of the empirical distribution to the law of normal distribution, calculate consent criteria .

Kolmogorov criterion - a goodness-of-fit criterion that allows one to assess the degree of closeness of the empirical distribution to normal. A. N. Kolmogorov proposed to use the maximum difference between the accumulated frequencies or frequencies of these series to determine the correspondence between the empirical and theoretical normal distributions. To test the hypothesis that the empirical distribution corresponds to the law of normal distribution, the goodness-of-fit criterion = D/ is calculated
, where D is the maximum difference between the cumulative (accumulated) empirical and theoretical frequencies, n is the number of units in the population. Using a special table, P() is determined - the probability of achieving , which means that if a variational characteristic is distributed according to a normal law, then For random reasons, the maximum discrepancy between the empirical and theoretical accumulated frequencies will be no less than the actually observed one. Based on the value of P(), certain conclusions are drawn: if the probability P() is sufficiently large, then the hypothesis that the actual distribution corresponds to the normal law can be considered confirmed; if the probability P() is small, then the null hypothesis is rejected, and the discrepancies between the actual and theoretical distributions are considered significant.

Probability values ​​for the goodness-of-fit criterion  Table 4.3

Pearson criteria 2 (“chi-square”) - goodness-of-fit criterion that allows one to assess the degree of closeness of the empirical distribution to normal:
, where f i, f" i are the frequencies of the empirical and theoretical distributions in a certain interval. The greater the difference between the observed and theoretical frequencies, the greater the criterion  2. To distinguish the significance of differences in the frequencies of the empirical and theoretical distributions according to the criterion  2 from differences due to chance samples, the calculated value of the criterion  2 calc is compared with the tabulated  2 table with the appropriate number of degrees of freedom and a given significance level. The significance level is selected so that P( 2 calc > 2 tab) = . hl, Where h– number of groups; l– the number of conditions that must be met when calculating theoretical frequencies. To calculate the theoretical frequencies of the normal distribution curve using the formula
you need to know three parameters , , f, therefore the number of degrees of freedom is h–3. If  2 calc > 2 tab, i.e.  2 falls into the critical region, then the discrepancy between the empirical and theoretical frequencies is significant and cannot be explained by random fluctuations in the sample data. In this case, the null hypothesis is rejected. If  2 calculation  2 tables, i.e. the calculated criterion does not exceed the maximum possible divergence of frequencies that can arise due to chance, then in this case the hypothesis about the correspondence of the distributions is accepted. The Pearson criterion is effective with a significant number of observations (n50), and the frequencies of all intervals must number at least five units (with a smaller number, the intervals are combined), and the number of intervals (groups) must be large (h>5), since the estimate  2 depends on the number of degrees of freedom.

Romanovsky criterion - a goodness-of-fit criterion that allows one to assess the degree of closeness of the empirical distribution to normal. V.I. Romanovsky proposed to evaluate the closeness of the empirical distribution to the normal distribution curve in relation to:

, where h is the number of groups.

If the ratio is greater than 3, then the discrepancy between the frequencies of the empirical and normal distributions cannot be considered random and the hypothesis of a normal distribution law should be rejected. If the ratio is less than or equal to 3, then we can accept the hypothesis that the data distribution is normal.

To obtain an approximate idea of ​​the shape of the distribution of a random variable, a graph of its distribution series (polygon and histogram), function or distribution density is plotted. In the practice of statistical research one encounters very different distributions. Homogeneous populations are characterized, as a rule, by single-vertex distributions. Multivertex indicates the heterogeneity of the population being studied. In this case, it is necessary to regroup the data in order to identify more homogeneous groups.

Determining the general nature of the distribution of a random variable involves assessing the degree of its homogeneity, as well as calculating the indicators of asymmetry and kurtosis. In a symmetric distribution, in which the mathematical expectation is equal to the median, i.e. , it can be considered that there is no asymmetry. But the more noticeable the asymmetry, the greater the deviation between the characteristics of the distribution center - the mathematical expectation and the median.

The simplest coefficient of asymmetry of the distribution of a random variable can be considered , where is the mathematical expectation, is the median, and is the standard deviation of the random variable.

In the case of right-sided asymmetry, left-sided asymmetry. If , the asymmetry is considered to be low, if - medium, and at - high. A geometric illustration of right- and left-sided asymmetry is shown in the figure below. It shows graphs of the distribution density of the corresponding types of continuous random variables.

Drawing. Illustration of right- and left-sided asymmetry in density plots of distributions of continuous random variables.

There is another coefficient of asymmetry of the distribution of a random variable. It can be proven that a non-zero central moment of an odd order indicates an asymmetry in the distribution of the random variable. In the previous indicator we used an expression similar to the first order moment. But usually in this other asymmetry coefficient the third-order central moment is used , and in order for this coefficient to become dimensionless, it is divided by the cube of the standard deviation. The resulting asymmetry coefficient is: . For this asymmetry coefficient, as for the first one in the case of right-sided asymmetry, left-sided - .

Kurtosis of a random variable

The kurtosis of the distribution of a random variable characterizes the degree of concentration of its values ​​near the center of the distribution: the higher the concentration, the higher and narrower the density graph of its distribution will be. The kurtosis (sharpness) indicator is calculated using the formula: , where is the central moment of the 4th order, and is the standard deviation raised to the 4th power. Since the powers of the numerator and denominator are the same, kurtosis is a dimensionless quantity. In this case, it is accepted as the standard of absence of kurtosis, zero kurtosis, to take the normal distribution. But it can be proven that for a normal distribution . Therefore, in the formula for calculating kurtosis, the number 3 is subtracted from this fraction.

Thus, for a normal distribution the kurtosis is zero: . If the kurtosis is greater than zero, i.e. , then the distribution is more peaked than normal. If the kurtosis is less than zero, i.e. , then the distribution is less peaked than normal. The limiting value of negative kurtosis is the value of ; the magnitude of positive kurtosis can be infinitely large. What graphs of peaked and flat-topped distribution densities of random variables look like in comparison with a normal distribution is shown in the figure.

Drawing. Illustration of peaked and flat-topped density distributions of random variables compared to the normal distribution.

The asymmetry and kurtosis of the distribution of a random variable show how much it deviates from the normal law. For large asymmetries and kurtosis, calculation formulas for normal distribution should not be used. The level of admissibility of asymmetry and kurtosis for the use of normal distribution formulas in the analysis of data for a specific random variable should be determined by the researcher based on his knowledge and experience.

Skewness and kurtosis of the distribution of a random variable.

090309-matmetody.txt

Characteristics of asymmetry.

The main measure of asymmetry is the asymmetry coefficient. That is, the degree to which the frequency distribution graph deviates from a symmetrical form relative to the average value. It is designated by the letter A with the index s and is calculated according to the formula (Fig. 8). The asymmetry coefficient varies from minus infinity to plus infinity. Asymmetry is left-sided (positive) when the coefficient is greater than zero - As>0 and right-sided (negative) - As<0. При левосторонней ассиметрии чаще встречаются значения ниже среднего арифметического. При правой, соответственно чаще всего встречаются значения, превосходящие среднее арифметическое. Для симметричных распределений коэффициент ассиметрии равен нулю, а мода, медиана и среднее арифметическое значение совпадают между собой.

Characteristics of kurtosis.

Characterizes its coefficient of kurtosis (or peakiness) - calculated using the formula.

The peak distribution is characterized by positive kurtosis, the flat peak distribution is characterized by negative kurtosis, and the middle peak distribution has zero kurtosis.

Firstly, secondly,

If you-(usually interval).

Graphic method(Q- Q Plots, R-RPlots).





Where N- sample size.

    Properties of normal distribution of a random variable.

090309-matmetody.txt

Normal distribution.

A normal distribution is characterized by the fact that extreme values ​​of characteristics are relatively rare, and those close to the arithmetic mean are relatively common. The normal distribution curve has a bell-shaped shape. This is a unimodal distribution, the values ​​of the median, mode and arithmetic mean of which coincide with each other, the skewness and kurtosis coefficients lie in the range from zero to two (acceptable), but ideally are equal to zero.

Since the second half of the 19th century, measurement and computational methods in psychology have been developed based on the following principle. If indiethe visual variability of a certain property is a consequence of the action of many causes, then the frequency distribution for the entire variety of manifestationsthis property in the general population corresponds to the normal curvedistributions. This is the law of normal distribution.

The law of normal distribution has a number of very important consequences, which we will refer to more than once. Now we note that if, when studying a certain property, we measured it on a sample of subjects and obtained a distribution that differed from the normal one, this means that either the sample is not representative of the general population, or the measurements were not made on a scale of equal intervals.

TO
Each psychological (or more broadly, biological) property corresponds to its distribution in the general population. Most often it is normal and is characterized by its parameters: average (M) and standard deviation (o). Only these two values ​​distinguish from each other an infinite set of normal curves of the same shape, given by equation (5.1). The average specifies the position of the curve on the number axis and acts as some initial standard measurement value. The standard deviation sets the width of this curve, depends on the units of measurement and acts as measurement scale(Fig. 5.3).

Figure 5.3. Family of normal curves, the 1st distribution differs from the 2nd by standard deviation (σ 1< σ 2), 2-е от 3-го средним арифметическим (M 2 < M 3)

The entire variety of normal distributions can be reduced to one curve if we apply the ^-transformation (according to formula 4.8) to all possible measurements of properties. Then each property will have a mean of 0 and a standard deviation of 1. In Fig. 5.4 a graph of normal distribution is plotted for M= 0 and a = 1. This is itunit normal distribution, who-swarm is used as a standard - standard. Let's consider it important properties.

    The unit of measurement for a unit normal distribution is the standard deviation.

    The curve approaches the Z axis at the edges asymptotically - never touching it.

    The curve is symmetrical about M=0. Its asymmetry and kurtosis are zero.

    The curve has a characteristic bend: the inflection point lies exactly at a distance of one σ from M.

    The area between the curve and the Z axis is 1.

The last property explains the name single normal distribution and is extremely important. Thanks to this property the area under the curve is interpreted as a probability, or relativefrequency. Indeed, the entire area under the curve corresponds to the probability that the characteristic will take any value from the entire range of its variability (from -oo to +oo). The area under a unit normal curve to the left or right of the zero point is 0.5. This corresponds to the fact that half of the general population has a characteristic value greater than 0, and half - less than 0. The relative frequency of occurrence in the general population of characteristic values ​​in the range from Z\ to Zi equal to the area under the curve lying between the corresponding points. Let us note again that any normal distribution can be reduced to a unit normal distribution by z- transformations.

So, the most important common property of different normal distribution curves is the same proportion of the area under the curve between the same two values ​​of the attribute, expressed in units of standard deviation.

It is useful to remember that for any normal distribution there are the following correspondences between the ranges of values ​​​​and the area under the curve:

A single normal distribution establishes a clear relationship between the standard deviation and the relative number of cases in the population for any normal distribution. For example, knowing the properties of a unit normal distribution, we can answer the following questions. What proportion of the general population has a property expression from - \O up to +1o? Or what is the probability that a randomly selected representative of the general population will have a property intensity that is greater than the average value? In the first case, the answer will be 68.26% of the entire population, since from - 1 to +1 there is 0.6826 of the area of ​​a unit normal distribution. In the second case, the answer is: (100-99.72)/2 = 0.14%.

There is a special table that allows you to determine the area under the curve to the right of any positive z (Appendix 1). Using it, you can determine the probability of occurrence of attribute values ​​from any range. This is widely used in interpreting test data.

Despite the initial postulate that properties in the population have a normal distribution, actual data obtained from a sample are rarely normally distributed. Moreover, many methods have been developed that make it possible to analyze data without any assumption about the nature of their distribution, both in the sample and in the population. These circumstances sometimes lead to the false belief that the normal distribution is an empty mathematical abstraction that has no relation to psychology. However, as we will see later, there are at least three important aspects of the application of the normal distribution:

    Development of test scales.

    Checking the normality of the sampling distribution to make a decision
    decisions about what scale the attribute is measured on - metric or conventional
    private

    Statistical testing of hypotheses, in particular when determining risk
    making the wrong decision.

    Standard normal distribution. Standardization of distributions.

(For the entire question No. 12 + about standardization, see below)

091208-matmetody.txt

Standardization psychodiagnostic methods (more on this in question No. 17)

    Population and sample.

091208-matmetody.txt

General populations.

Any psychodiagnostic technique is intended for examining a certain large category of individuals. This set is called the population.

To determine the degree of expression of a particular property in one specific person, you need to know how this quality is distributed throughout the entire population. It is almost impossible to survey the general population, so they resort to extracting a sample from the general population, that is, some representative part of the general population. It is this representativeness (otherwise it is called “representativeness”) that is the main requirement for the sample. It is impossible to ensure an absolutely exact match of this requirement. You can only get closer to the ideal using certain methods. The main ones are 1) randomness and 2) modeling.

1) Random sampling assumes that subjects will be included in it at random. Measures are being taken to ensure that no patterns emerge.

2) When modeling, first those properties that can affect the test results are selected. Usually these are demographic characteristics, within which gradations are distinguished: age intervals, levels of education, etc. Based on these data, a matrix model of the general population is constructed.

Typically, methods are standardized on a sample of 200 to 800 people.

Standardization of psychodiagnostic methods is the procedure for obtaining a scale that allows you to compare an individual test result with the results of a large group.

Research usually begins with some assumption that requires verification using facts. This assumption - a hypothesis - is formulated in relation to the connection of phenomena or properties in a certain set of objects.

To test such assumptions against facts, it is necessary to measure the corresponding properties of their bearers. But it is impossible to measure anxiety in all women and men, just as it is impossible to measure aggressiveness in all adolescents. Therefore, when conducting research, it is limited to only a relatively small group of representatives of the relevant populations of people.

Population- this is the entire set of objects in relation to which a research hypothesis is formulated.

In the first example, such general populations are all men and all women. In the second - all teenagers who watch television programs containing scenes of violence. The general populations in relation to which the researcher is going to draw conclusions based on the results of the study may be more modest in size.

Thus, the general population is, although not an infinite number of people, but, as a rule, a set of potential subjects inaccessible for continuous research.

Sample- this is a group of objects limited in number (in psychology - subjects, respondents), specially selected from the general population to study its properties. Accordingly, studying the properties of a general population using a sample is called sampling study. Almost all psychological studies are selective, and their conclusions apply to general populations.

Thus, after a hypothesis has been formulated and the corresponding populations have been identified, the researcher faces the problem of organizing a sample. The sample should be such that the generalization of the conclusions of the sample study is justified - generalization, extension of them to the general population. The main criteria for designationvalidity of the research findings- this is the representativeness of the sample and thestatistical reliability of (empirical) results.

Representativeness of the sample- in other words, its representativeness is the ability of the sample to represent the phenomena under study quite fully from the point of view of their variability in the general population.

Of course, only the general population can give a complete picture of the phenomenon being studied, in all its range and nuances of variability. Therefore, representativeness is always limited to the extent that the sample is limited. And it is the representativeness of the sample that is the main criterion in determining the boundaries of generalization of research findings. Nevertheless, there are techniques that make it possible to obtain a representative sample that is sufficient for the researcher. (Question #15 is a continuation of this question)

    Basic methods of sampling.

With. 13 (20) (Question #14 is a prelude to this question)

The first and main technique is simple random (randomized)selection. It involves ensuring such conditions that each member of the population has equal chances with others to be included in the sample. Random selection ensures that a variety of representatives of the general population can be included in the sample. In this case, special measures are taken to prevent the emergence of any pattern during selection. And this allows us to hope that ultimately, in the sample, the property being studied will be represented, if not in all, then in its maximum possible diversity.

The second way to ensure representativeness is stratified random selection, or selection based on the properties of the population. It involves a preliminary determination of those qualities that can influence the variability of the property being studied (this could be gender, level of income or education, etc.). Then the percentage ratio of the number of groups (strata) differing in these qualities in the general population is determined and an identical percentage ratio of the corresponding groups in the sample is ensured. Next, subjects are selected into each subgroup of the sample according to the principle of simple random selection.

Statistical reliability, or statistical significance, the results of a study are determined using statistical inference methods. We will consider these methods in detail in the second part of this book. Now we just note that they have certain requirements for the number, or sample size.

Unfortunately, there are no strict guidelines for pre-determining the required sample size. Moreover, the researcher usually receives the answer to the question about the necessary and sufficient number too late - only after analyzing the data of an already surveyed sample. However, the most general recommendations can be formulated:

□ The largest sample size is required when developing a diagnostic technique - from 200 to 1000-2500 people.

If it is necessary to compare 2 samples, their total number should be
be at least 50 people; the number of compared samples should
be approximately the same.

P If the relationship between any properties is being studied, then the sample size should be at least 30-35 people.

□ The more variability property being studied, the greater should be
sample size. Therefore, variability can be reduced by increasing
homogeneity of the sample, for example, by gender, age, etc. At the same time,
Naturally, the possibilities for generalizing conclusions are reduced.

Dependent and independent samples. A common research situation is when a property of interest to a researcher is studied on two or more samples for the purpose of further comparison. These samples can be in different proportions, depending on the procedure for their organization. Independentvalid samples are characterized by the fact that the probability of selection of any subject in one sample does not depend on the selection of any of the subjects in another sample. Against, dependent samples are characterized by the fact that each subject from one sample is matched according to a certain criterion by a subject from another sample.

In general, dependent samples involve pairwise selection of subjects into compared samples, and independent samples imply an independent selection of subjects.

It should be noted that cases of “partially dependent” (or “partially independent”) samples are unacceptable: this unpredictably violates their representativeness.

In conclusion, we note that two paradigms of psychological research can be distinguished. The so-called R-methodology involves the study of the variability of a certain property (psychological) under the influence of a certain influence, factor or other property. The sample is multi- number of subjects . Another approach Q-methodology, involves the study of the variability of a subject (individual) under the influence of various stimuli (conditions, situations, etc.). It corresponds to the situation when the sample is there are many incentives .

    Checking the sample for anomalous values.

To test normality, various procedures are used to determine whether the sampling distribution of the measured variable differs from normal. The need for such a comparison arises when we doubt what scale the attribute is represented on - ordinal or metric. And such doubts arise very often, since we, as a rule, do not know in advance on what scale it will be possible to measure the property being studied (excluding, of course, cases of clearly nominative measurement).

The importance of determining on what scale a trait is measured cannot be overestimated, for at least two reasons. It depends on this Firstly, completeness of taking into account initial empirical information (in particular, about individual differences), secondly, availability of many data analysis methods. If the researcher decides to measure on an ordinal scale, then the inevitable subsequent ranking leads to the loss of part of the original information about the differences between subjects, studied groups, relationships between characteristics, etc. In addition, metric data allows the use a significantly wider range of analysis methods and, as a result, make the research conclusions deeper and more meaningful.

The most compelling argument in favor of the fact that the characteristic is measured on a metric scale is the correspondence of the sample distribution to normal. This is a consequence of the law of normal distribution. If you-the Boroch distribution does not differ from the normal one, this means thatthe measured property was reflected in the metric scale(usually interval).

There are many different ways to test for normality, of which we will briefly describe only a few, assuming that the reader will perform these tests using computer programs.

Graphic method(Q- Q Plots, R-RPlots). They build either quantile graphs or graphs of accumulated frequencies. Quantile plots (Q- Q Plots) are constructed as follows. First, the empirical values ​​of the characteristic being studied are determined, corresponding to the 5th, 10th, ..., 95th percentile. Z-scores (theoretical) are then determined from the normal distribution table for each of these percentiles. The two resulting series of numbers specify the coordinates of the points on the graph: the empirical values ​​of the attribute are plotted on the abscissa axis, and the corresponding theoretical values ​​are plotted on the ordinate axis. For a normal distribution, all points will bepress on or near the same line. The greater the distance from the points to the straight line, the less the distribution corresponds to normal. Graphs of accumulated frequencies (PPPlots) are built in a similar way. The values ​​of accumulated relative frequencies are plotted on the abscissa axis at equal intervals, for example 0.05; 0.1; ...; 0.95. Next, the empirical values ​​of the characteristic being studied are determined, corresponding to each value of the accumulated frequency, which are converted into z-scores. Bythe normal distribution table determines the theoretical accumulationmeasured frequencies (area under the curve) for each of the calculated r-values, which are plotted on the ordinate. If the distribution iscorresponds to normal, the points obtained on the graph lie on the samedirect.

Criteria for skewness and kurtosis. These criteria determine the permissible degree of deviation of the empirical values ​​of skewness and kurtosis from zero values ​​corresponding to the normal distribution. The acceptable degree of deviation is the one that allows us to consider that these statistics do not differ significantly from normal parameters. The amount of permissible deviations is determined by the so-called standard errors of asymmetry and kurtosis. For the asymmetry formula (4.10), the standard error is determined by the formula:

Where N- sample size.

Sample values ​​of skewness and kurtosis are significantly different from zero if they do not exceed their standard errors. This can be considered a sign that the sampling distribution corresponds to the normal law. It should be noted that computer programs calculate indicators of asymmetry, kurtosis and the corresponding standard errors using other, more complex formulas.

Kolmogorov-Smirnov statistical normality test is considered the most suitable for determining the degree of compliance of the empirical distribution with the normal one. It allows you to estimate the probability that a given sample belongs to a population with a normal distribution. If this probability r< 0.05, then this empirical distribution differs significantly from normal, and if r> 0.05, then they conclude that this empirical distribution approximately corresponds to the normal one.

Reasons for deviation from normality. The general reason for the deviation of the shape of the sample distribution of a characteristic from the normal form is most often a feature of the measurement procedure: the scale used may have uneven sensitivity to the measured property in different parts of the range of its variability.

EXAMPLE Suppose the severity of a certain ability is determined by the number of tasks completed in the allotted time. If the tasks are simple or the time is too long, then this measurement procedure will have sufficient sensitivity only for a part of the subjects for whom these tasks are quite difficult. And too large a proportion of subjects will solve all or almost all tasks. As a result, we will obtain a distribution with pronounced right-sided asymmetry. It is, of course, possible to subsequently improve the quality of the measurement through empirical normalization by adding more complex tasks or reducing the time required to complete a given set of tasks. If we overly complicate the measurement procedure, then the opposite situation will arise when most of the subjects will solve a small number of tasks and the empirical distribution will acquire a left-sided asymmetry.

Thus, deviations from the normal form, such as right- or left-sided asymmetry or too large kurtosis (greater than 0), are associated with the relatively low sensitivity of the measurement procedure in the mode region (the top of the frequency distribution graph).

Consequences of deviation from normality. It should be noted that the task of obtaining an empirical distribution that strictly corresponds to the normal law is not often encountered in research practice. Typically, such cases are limited to the development of a new measurement procedure or test scale, when empirical or nonlinear normalization is used to “correct” the empirical distribution. In the majoritycases of conformity or non-conformity with normality is the nature ofthe property of the measured characteristic, which the researcher must take into account whenselection of statistical procedures for data analysis.

In general, if there is a significant deviation of the empirical distribution from the normal one, one should abandon the assumption that the characteristic is measured on a metric scale. But the question remains open: what is the measure of the significance of this deviation? In addition, different data analysis methods have different sensitivity to deviations from normality. Usually, when justifying the prospects of this problem, the principle of R. Fisher, one of the “founding fathers” of modern statistics, is cited: "Deviations from normalof this type, unless they are too noticeable, can only be detected by largenew samples; by themselves they make little difference in the statistical critria and other issues." For example, with small but typical samples for psychological research (up to 50 people), the Kolmogorov-Smirnov criterion is not sensitive enough in determining even very noticeable “by eye” deviations from normality. At the same time, some procedures for analyzing metric data fully allow deviations from the normal distribution (some to a greater extent, others to a lesser extent). In the future, when presenting the material, we will, if necessary, stipulate the degree of rigidity of the normality requirement.

    Basic rules for standardization of psychodiagnostic techniques.

091208-matmetody.txt

Standardization psychodiagnostic methods is the procedure for obtaining a scale that allows you to compare an individual test result with the results of a large group.

Test scales are developed in order to evaluate an individual test result by comparing it with test norms obtained from a standardization sample. Standardization sampling is specially formed for the development of a test scale - it must be representative of the general population for which this test is planned to be used. Subsequently, when testing, it is assumed that both the person being tested and the standardization sample belong to the same general population.

The starting principle when developing a test scale is the assumption that the property being measured is distributed in the general population in accordance with the normal law. Accordingly, the measurement of this property in the test scale on the standardization sample should also ensure a normal distribution. If this is the case, then the test scale is metric - more precisely, equal intervals. If this is not the case, then the property could be reflected, at best, in the order scale. Naturally, most standard test scales are metric, which allows you to interpret test results in more detail - taking into account the properties of the normal distribution - and correctly apply any methods of statistical analysis. Thus, the main problem of the standardtest test is to develop a scale in which the distributionThe reduction of test indicators on the standardization sample would correspondnormal distribution.

Initial test scores are the number of answers to certain test questions, the time or number of problems solved, etc. They are also called primary, or “raw” scores. The result of standardization is test norms - a table for converting “raw” grades into standard test scales.

There are many standard test scales, the main purpose of which is to present individual test results in a form convenient for interpretation. Some of these scales are presented in Fig. 5.5. What they have in common is compliance with the normal distribution, and they differ only in two indicators: the average value and the scale (standard deviation - o), which determines the granularity of the scale.

General sequence of standardization(development of test standards - tables for converting “raw” scores into standard test scores) is as follows:

    the general population for which it is being developed is determined
    methodology and a representative sample of standardization is formed;

    Based on the results of applying the primary version of the test, a distribution
    determination of “raw” estimates;

    check the compliance of the resulting distribution with the normal
    kon;

    if the distribution of “raw” estimates corresponds to normal, pro-
    harassed linear standardization;

    if the distribution of “raw” estimates does not correspond to normal, then
    two options are possible:

    before linear standardization, an empirical norm is produced -
    lization;

    carry out nonlinear normalization.

The distribution of “raw” estimates is checked for compliance with the normal law using special criteria, which we will consider later in this chapter.

Linear standardization lies in the fact that the boundaries of the intervals of “raw” estimates are determined, corresponding to standard test indicators. These boundaries are calculated by adding to the average “raw” scores (or subtracting from it) the shares of standard deviations corresponding to the test scale.

Test norms - table for converting “raw” points into walls

"Raw" points

Using this table of test norms, the individual result (“raw” score) is converted into a wall scale, which allows one to interpret the severity of the property being measured.

Empirical normalization used when the distribution of “raw” scores differs from normal. It consists in changing the content of test tasks. For example, if the “raw” score is the number of problems solved by the test takers in the allotted time, and a distribution with right-sided asymmetry is obtained, then this means that too large a proportion of the test-takers solve more than half of the tasks. In this case, it is necessary to either add more difficult tasks or reduce the solution time.

Nonlinear normalization is used if empirical normalization is impossible or undesirable, for example, from the point of view of time and resources. In this case, the conversion of “raw” estimates into standard ones is carried out by finding the percentile boundaries of groups in the original distribution, corresponding to the percentile boundaries of groups in the normal distribution of the standard scale. Each interval of the standard scale is associated with an interval of the “raw” ratings scale that contains the same percentage of the standardization sample. The values ​​of the shares are determined by the area under the unit normal curve, enclosed between the r-estimates corresponding to a given interval of the standard scale.

For example, in order to determine what “raw” score should correspond to the lower limit wall 10, you must first find out what r-value this limit corresponds to (z = 2). Then, using the table of normal distribution (Appendix 1), it is necessary to determine what proportion of the area under the normal curve is to the right of this value (0.023). After this, it is determined which value cuts off the 2.3% of the highest values ​​of the “raw” scores of the standardization sample. The found value will correspond to the boundary of the 9th and 10th walls.

The stated fundamentals of psychodiagnostics allow us to formulate mathematically sound requirements for the test. The test procedure must complyhold:

    description of the standardization sample;

    characteristics of the distribution of “raw” scores indicating the average and
    standard deviation;

    name, characteristics of the standard scale;

    test norms - tables for converting “raw” scores into scale scores.

    Z-score scale. (???)

091208-matmetody.txt

The standardized (or standard) deviation is usually denoted by the letter Z. (Fig. 1 in the notebook) Z-scores are obtained.

A special place among normal distributions is occupied by the so-called standard or unit normal distribution. This distribution is obtained provided that the arithmetic mean is zero and the standard deviation is 1. The normal distribution is convenient because any distribution can be reduced to it by standardization.

The standardization operation is as follows: the arithmetic mean is subtracted from each individual parameter value. This operation is called centering. And the resulting difference is divided by the standard deviation. This operation is called normalization.

With. 47 (54) (see picture with scales there)

monitoring2.htm

Thus, if we subtract a particular subject's score from the mean and divide the difference by the standard deviation, we can express the individual score as a fraction of the standard deviation. The diagnostic shares obtained in this way are called Z-scores. Z – score is the basis of any standard scale. The most attractive property of z-scores is that they characterize the relative position of the subject's result among all the results of the group, regardless of the mean and standard deviation. In addition, z-scores are unit-free. Thanks to these two properties of z-scores, they can be used to compare results obtained in a variety of ways and on a variety of aspects of the behavior sample.

Stanine scale
Wall scale
T-scale
IQ scale

    Scales derived from the Z-score scale.

monitoring2.htm (there’s also a good start about standardization and standard deviation)

The disadvantage of the z-score is that you have to deal with fractional and negative values. Therefore, it is usually converted into so-called standard scales, which are more convenient to use. Traditionally and more often than others in diagnostics, the following scales are used:

Stanine scale
Wall scale
T-scale
IQ scale

With. 47 (54) (see picture with scales there)

0028.htm 7. Standardization of psychological questionnaire

Normalization of testing indicators.

In order for the psychological questionnaire to be used practically, i.e. To make a prediction of his behavior in new situations based on its completion by a randomly selected subject (using the validity criteria of this questionnaire), it is necessary to normalize the indicators on a normative sample. Only the use of statistical standards makes it possible to judge the increase or decrease in the severity of a particular psychological quality in a particular subject. Although norms are important for applied psychology, it is easiest for psychological research to use raw measures directly.

The performance of a particular subject should be compared with the performance of an adequate normative group. This is accomplished through some transformation that reveals the status of that individual relative to the given group.

Linear and nonlinear transformations of raw scale values. Standard indicators can be obtained by both linear and non-linear transformation of primary indicators. Linear transformations are obtained by subtracting a constant from the primary indicator and further dividing by another constant, therefore all the relationships characteristic of primary indicators also apply to linear ones. The most commonly used is the z-score (Formula 3).

But due to the fact that often the distribution of final scores on one or another scale is not normal, percentiles cannot be derived from these standardized indicators, i.e. estimate how many percent of subjects received the same indicator as the given subject.

If percentile normalization with conversion to walls and linear normalization with conversion to walls give the same wall values, then the distribution is considered normal to within a standard ten.

To achieve comparability of results belonging to distributions of different shapes, a nonlinear transformation can be applied.

Normalized standard scores obtained using a nonlinear transformation are standard scores that correspond to a distribution that has been transformed so that it becomes normal. To calculate them, special tables are created for converting raw points into standard ones. They give the percentage of cases of various degrees of deviations (in units of σ from the average value). Thus, the mean value that corresponds to achieving 50% of the group's results can be equated to 0. The mean minus standard deviation can be equated to -1, this new value will be observed in about 16% of the sample, and the value +1 - in about 84%. .

work “Work of speech therapy groups”; 2. “Compliance with... sanitary standards in school canteens”; 3. "Oh work Administration of the Voivodeship Special (Correctional) School...

  • Work plan (21)

    Questions for the exam

    Planwork Questions for exam 1 21. Types... and refer to the previous criterion. Further Job with the Page criterion is to transform the table... the investigative connection is justified in the theoretical part work and is confirmed by many authors, then...

  • 2.6 Skewness and kurtosis

    In mathematical statistics, to determine the geometric form of the probability density of a random variable, two numerical characteristics associated with the central moments of the third and fourth orders are used.

    Definition 2.22 Sample asymmetry coefficientx 1 , x 2 , …, x n is a number equal to the ratio of the third-order central sample moment to the cube of the standard deviation S:

    Since , then the asymmetry coefficient is expressed through the central moments by the following formula:

    From here we obtain a formula expressing the asymmetry coefficient through the initial moments:

    , which facilitates practical calculations.

    The corresponding theoretical characteristic is introduced using theoretical points.

    Definition 2.23 Asymmetry coefficient of a random variableXcalled numberequal to the third order central moment ratioto the cube of standard deviation:

    If a random variable X has a symmetric distribution relative to the mathematical expectation μ, then its theoretical asymmetry coefficient is equal to 0, but if the probability distribution is asymmetrical, then the asymmetry coefficient is different from zero. A positive value of the asymmetry coefficient indicates that most of the values ​​of the random variable are located to the right of the mathematical expectation, that is, the right branch of the probability density curve is longer than the left. A negative value for the asymmetry coefficient indicates that the longer part of the curve is located on the left. This statement is illustrated by the following figure.

    Figure 2.1 – Positive and negative asymmetry

    distributions

    Example 2.29 Let's find the sample asymmetry coefficient based on the data from the study of stressful situations from example 2.28.

    Using the previously calculated values ​​of the central sample moments, we obtain

    .

    Round up = 0.07. The found non-zero value of the asymmetry coefficient shows the skewness of the distribution relative to the mean. A positive value indicates that the longer branch of the probability density curve is to the right.

    The following constant characterizes the distribution of random variable values ​​around its modal value X modes.

    Definition 2.24 Kurtosis of the samplex 1 , x 2 , …, x ncalled number , equal

    ,

    Where– selective central moment of the fourth order,

    S 4 – fourth degree of standarddeviationsS.

    The theoretical concept of kurtosis is an analogue of sampling.

    Definition 2.25 Kurtosis of a random variableXcalled number e, equal

    ,

    Wheretheoretical fourth order central moment,

    fourth degree of standard deviation.

    Kurtosis value e characterizes the relative steepness of the top of the distribution density curve around the maximum point. If kurtosis is a positive number, then the corresponding distribution curve has a sharper peak. A distribution with negative kurtosis has a smoother and flatter top. The following figure illustrates possible cases.

    Figure 2.2 – Distributions with positive, zero and negative kurtosis values

    Skewness is calculated by the SKES function. Its argument is the interval of cells with data, for example, =SKES(A1:A100), if the data is contained in the interval of cells from A1 to A100.

    Kurtosis is calculated by the KURTESS function, the argument of which is numeric data, usually specified as an interval of cells, for example: =KURTESS(A1:A100).

    §2.3. Analysis Tool Descriptive Statistics

    IN Excel it is possible to calculate all point characteristics of a sample at once using the analysis tool Descriptive Statistics, which is contained in Analysis package.

    Descriptive Statistics creates a table of basic statistical characteristics for the data set. This table will contain the following characteristics: mean, standard error, dispersion, standard deviation, mode, median, range of interval variation, maximum and minimum values, asymmetry, kurtosis, population volume, sum of all population elements, confidence interval (reliability level). Tool Descriptive Statistics significantly simplifies statistical analysis in that there is no need to call each function to calculate statistical characteristics separately.

    In order to call Descriptive statistics, follows:

    1) in the menu Service select a team Data Analysis;

    2) in the list Analysis Tools dialog box Data Analysis select instrument Descriptive Statistics and press OK.

    In the window Descriptive Statistics necessary:

    · in a group Input data in the field Input interval specify the range of cells containing data;

    · if the first row in the input range contains a column header, then Labels field in the first line should be checked;

    · in a group Output Options activate the switch (check the box) Summary statistics, if you need a complete list of characteristics;

    · activate the switch Reliability level and specify the reliability in % if you need to calculate a confidence interval (the default reliability is 95%). Click OK.

    As a result, a table will appear with the calculated values ​​of the above statistical characteristics. Immediately, without deselecting this table, run the command Format® Column® Auto width selection.

    Dialog box view Descriptive Statistics:

    Practical tasks

    2.1. Calculation of basic point statistics using standard functions Excel

    The same voltmeter measured the voltage on a section of the circuit 25 times. As a result of the experiments, the following voltage values ​​in volts were obtained:

    32, 32, 35, 37, 35, 38, 32, 33, 34, 37, 32, 32, 35,

    34, 32, 34, 35, 39, 34, 38, 36, 30, 37, 28, 30.

    Find the mean, sample and corrected variance, standard deviation, range of variation, mode, median. Test deviation from normal distribution by calculating skewness and kurtosis.

    To complete this task, complete the following steps.

    1. Type the results of the experiment in column A.

    2. In cell B1 type “Average”, in B2 – “Sample variance”, in B3 – “Standard deviation”, in B4 – “Corrected variance”, in B5 – “Corrected standard deviation”, in B6 – “Maximum”, in B7 – “Minimum”, in B8 – “Range of variation”, in B9 – “Mode”, in B10 – “Median”, in B11 – “Asymmetry”, in B12 – “Kurtosis”.

    3. Adjust the width of this column using Auto-selection width.

    4. Select cell C1 and click on the button with the “=” sign in the formula bar. By using Function Wizards in category Statistical find the AVERAGE function, then highlight the range of data cells and click OK.

    5. Select cell C2 and click on the = sign in the formula bar. By using Function Wizards in category Statistical find the VAR function, then highlight the range of data cells and click OK.

    6. Do the same steps yourself to calculate the remaining characteristics.

    7. To calculate the range of variation in cell C8, enter the formula: =C6-C7.

    8. Add one line in front of your table, in which type the headings of the corresponding columns: “Name of characteristics” and “Numerical values”.



    Did you like the article? Share with your friends!