The criterion for agreement is. Pearson goodness-of-fit test

Statistical hypotheses. Consent criteria.

Null(basic) call the hypothesis put forward about the species unknown distribution, or about the parameters of known distributions. Competing (alternative) called a hypothesis that contradicts the null hypothesis.

For example, if the null hypothesis is that the random variable X is distributed according to the law, then a competing hypothesis might be that the random variable X distributed according to a different law.

Statistical criterion(or simply criterion) is called a random variable TO, which serves to test the null hypothesis.

After selection certain criterion, for example, a criterion, the set of all its possible values are divided into two disjoint subsets: one of them contains the criterion values ​​at which the null hypothesis is rejected, and the other - at which it is accepted.

Critical area is a set of criterion values ​​at which the null hypothesis is rejected. Hypothesis Acceptance Area call the set of criterion values ​​at which the hypothesis is accepted. Critical points They call the points separating the critical region from the region where the null hypothesis is accepted.

For our example, with a value of , the value calculated from the sample corresponds to the area of ​​acceptance of the hypothesis: the random variable is distributed according to the law. If the calculated value is , then it falls into the critical region, that is, the hypothesis about the distribution random variable legally rejected.

In the case of distribution, the critical region is determined by the inequality, and the region where the null hypothesis is accepted is determined by the inequality.

2.6.3. Agreement criterion Pearson.

One of the tasks of animal science and veterinary genetics is the breeding of new breeds and species with the required traits. For example, increasing immunity, resistance to disease, or changing the color of the fur.

In practice, when analyzing the results, it very often turns out that the actual results more or less correspond to some theoretical law distributions. There is a need to assess the degree of correspondence between actual (empirical) data and theoretical (hypothetical) data. To do this, put forward a null hypothesis: the resulting population is distributed according to the “A” law. The hypothesis about the expected distribution law is tested using a specially selected random variable - the goodness-of-fit criterion.

Agreement criterion is called a criterion for testing a hypothesis about the assumed law of an unknown distribution.

There are several criteria of agreement: Pearson, Kolmogorov, Smirnov, etc. The Pearson goodness-of-fit test is the most commonly used.

Let us consider the application of the Pearson criterion using the example of testing the hypothesis about the normal law of distribution of the population. For this purpose, we will compare empirical and theoretical (calculated in the continuation of the normal distribution) frequencies.

There is usually some difference between theoretical and empirical frequencies. For example:

Empirical frequencies 7 15 41 93 113 84 25 13 5

Theoretical frequencies 5 13 36 89 114 91 29 14 6

Let's consider two cases:

The discrepancy between theoretical and empirical frequencies is random (insignificant), i.e. it is possible to make a proposal about the distribution of empirical frequencies according to normal law;

The discrepancy between theoretical and empirical frequencies is not accidental (significant), i.e. theoretical frequencies were calculated based on the incorrect hypothesis of a normal population distribution.

Using the Pearson goodness-of-fit test, you can determine whether the discrepancy between theoretical and empirical frequencies is accidental or not, i.e. with a given confidence probability determine distributed population according to normal law or not.

So, let the empirical distribution be obtained from a sample of size n:

Options……

Empirical frequencies…….

Let us assume that theoretical frequencies are calculated under the assumption of a normal distribution. At the significance level, it is necessary to test the null hypothesis: the population is normally distributed.

As a criterion for testing the null hypothesis, we will take a random variable

(*)

This value is random, since in various experiences it takes on different, previously unknown values. It is clear that the less the empirical and theoretical frequencies differ, the smaller the value of the criterion and, therefore, it is to a certain extent characterizes the closeness of the empirical and theoretical distributions.

It has been proven that when the distribution law of a random variable (*), regardless of which distribution law the general population is subject to, tends to a distribution law with degrees of freedom. Therefore, the random variable (*) is denoted by , and the criterion itself is called the “chi-square” goodness-of-fit test.

Let us denote the value of the criterion calculated from observational data by . Tabulated critical values criteria for this level significance and number of degrees of freedom denote . In this case, the number of degrees of freedom is determined from the equality , where the number of groups ( partial intervals) samples or classes; - number of parameters of the expected distribution. The normal distribution has two parameters - expected value and average standard deviation. Therefore, the number of degrees of freedom for a normal distribution is found from the equality

If for the calculated value and table value inequality holds , the null hypothesis about the normal distribution of the population is accepted. If , the null hypothesis is rejected and the alternative hypothesis is accepted (the population is not normally distributed).

Comment. When using Pearson's goodness-of-fit test, the sample size must be at least 30. Each group must contain at least 5 options. If the groups contain less than 5 frequencies, they are combined with neighboring groups.

IN general case the number of degrees of freedom for the chi-square distribution is given by total number the quantities by which the corresponding indicators are calculated, minus the number of conditions that connect these quantities, i.e. reduce the possibility of variation between them. In the simplest cases, when calculating, the number of degrees of freedom will be equal to the number of classes reduced by one. So, for example, with dihybrid splitting, 4 classes are obtained, but only the first class is unrelated, the subsequent ones are already related to the previous ones. Therefore, for dihybrid splitting, the number of degrees of freedom is .



Example 1. Determine the degree of compliance of the actual distribution of groups by the number of cows with tuberculosis with the theoretically expected one, which was calculated when considering the normal distribution. The source data is summarized in the table:

Solution.

According to the level of significance and the number of degrees of freedom from the table critical points distribution (see Appendix 4) we find the value . Because the , we can conclude that the difference between theoretical and actual frequencies is random. Thus, the actual distribution of groups by the number of cows with tuberculosis corresponds to the theoretically expected.

Example 2. Theoretical distribution according to the phenotype of individuals obtained in the second generation by dihybrid crossing of rabbits according to Mendel’s law, it is 9: 3: 3: 1. It is required to calculate the correspondence of the empirical distribution of rabbits from crossing black individuals with normal hair with downy animals - albino. When crossing in the second generation, 120 descendants were obtained, including 45 black with short hair, 30 black downy rabbits, 25 white with short hair, 20 white downy rabbits.

Solution. Theoretically, the expected segregation in the offspring should correspond to the ratio of the four phenotypes (9: 3: 3: 1). Let's calculate the theoretical frequencies (number of goals) for each class:

9+3+3+1=16, which means we can expect that there will be black shorthairs ; black downy - ; white shorthaired - ; white downy - .

The empirical (actual) distribution of phenotypes was as follows: 45; thirty; 25; 20.

Let's summarize all this data in the following table:

Using the Pearson goodness-of-fit test, we calculate the value:

Number of degrees of freedom in dihybrid crossing. For significance level find the value . Because the , we can conclude that the difference between theoretical and actual frequencies is not random. Consequently, the resulting group of rabbits deviates in the distribution of phenotypes from Mendel’s law during dihybrid crossing and reflects the influence of certain factors that change the type of phenotypic segregation in the second generation of crossbreeds.

Pearson's chi-square goodness-of-fit test can also be used to compare two homogeneous objects with each other. empirical distributions, i.e. those that have the same class boundaries. The null hypothesis is the hypothesis that two unknown distribution functions are equal. The chi-square test in such cases is determined by the formula

(**)

where and are the volumes of the distributions being compared; and - frequencies of the corresponding classes.

Consider a comparison of two empirical distributions using the following example.

Example 3. The length of cuckoo eggs was measured using two territorial zones. In the first zone, a sample of 76 eggs () was examined, in the second of 54 (). The following results were obtained:

Length (mm)
Frequencies
Frequencies - - -

At the significance level, we need to test the null hypothesis that both samples of eggs belong to the same cuckoo population.

ODA The criterion for testing the hypothesis about the assumed law of an unknown distribution is called the goodness-of-fit criterion.

There are several goodness-of-fit tests: $\chi ^2$ (chi-square) by K. Pearson, Kolmogorov, Smirnov, etc.

Usually theoretical and empirical frequencies vary. The case of discrepancy may not be accidental, which means that it is explained by the fact that the hypothesis was not chosen correctly. The Pearson criterion answers the question posed, but like any criterion it does not prove anything, but only establishes its agreement or disagreement with observational data at the accepted level of significance.

ODA A sufficiently small probability at which an event can be considered practically impossible is called the significance level.

In practice, significance levels are usually taken to be between 0.01 and 0.05, $\alpha =0.05$ is the $5 ( \% ) $ significance level.

As a criterion for testing the hypothesis, we will take the value \begin(equation) \label ( eq1 ) \chi ^2=\sum ( \frac ( (( n_i -n_i" ))^2 ) ( n_i" ) ) \qquad (1) \ end(equation)

here $n_i -$ empirical frequencies obtained from the sample, $n_i" -$ theoretical frequencies found theoretically.

It has been proven that for $n\to \infty $ the distribution law of the random variable (1), regardless of the law by which the population is distributed, tends to the $\chi ^2$ law (chi-square) with $k$ degrees of freedom.

ODA The number of degrees of freedom is found by the equality $k=S-1-r$ where $S-$ is the number of interval groups, $r-$ is the number of parameters.

1) uniform distribution: $r=2, k=S-3 $

2) normal distribution: $r=2, k=S-3 $

3) exponential distribution: $r=1, k=S-2$.

Rule . Testing the hypothesis using the Pearson test.

  1. To test the hypothesis, calculate the theoretical frequencies and find $\chi _ ( obs ) ^2 =\sum ( \frac ( (( n_i -n_i" ))^2 ) ( n_i" ) ) $
  2. Using the table of critical points of the distribution $\chi ^2$ for a given significance level $\alpha $ and the number of degrees of freedom $k$, $\chi _ ( cr ) ^2 (( \alpha ,k ))$ are found.
  3. If $\chi _ ( obs ) ^2<\chi _ { кр } ^2 $ то нет оснований отвергать гипотезу, если не выполняется данное условие - то отвергают.

Comment To control the calculations, use the formula for $\chi ^2$ in the form $\chi _ (observed) ^2 =\sum ( \frac ( n_i^2 ) ( n_i" ) -n ) $

Testing the hypothesis of uniform distribution

The density function of the uniform distribution of the quantity $X$ has the form $f(x)=\frac ( 1 ) ( b-a ) x\in \left[ ( a,b )\right]$.

In order to test the hypothesis that a continuous random variable is distributed according to a uniform law at the significance level $\alpha $, it is required:

1) Find the sample mean $\overline ( x_b ) $ and $\sigma _b =\sqrt ( D_b ) $ from a given empirical distribution. Take as an estimate of the parameters $a$ and $b$ the quantities

$a = \overline x _b -\sqrt 3 \sigma _b $, $b = \overline x _b +\sqrt 3 \sigma _b $

2) Find the probability of a random variable $X$ falling into partial intervals $(( x_i ,x_ ( i+1 ) ))$ using the formula $ P_i =P(( x_i

3) Find the theoretical (leveling) frequencies using the formula $n_i" =np_i $.

4) Taking the number of degrees of freedom $k=S-3$ and the significance level $\alpha =0.05$ from the tables $\chi ^2$ we find $\chi _ ( cr ) ^2 $ for the given $\alpha $ and $k$, $\chi _ ( kr ) ^2 (( \alpha ,k ))$.

5) Using the formula $\chi _ (observed) ^2 =\sum ( \frac ( (( n_i -n_i" ))^2 ) ( n_i" ) ) $ where $n_i -$ are empirical frequencies, we find the observed value $\ chi _ ( obs ) ^2 $.

6) If $\chi _ ( obs ) ^2<\chi _ { кр } ^2 -$ нет оснований, отвергать гипотезу.

Let's test the hypothesis using our example.

1) $\overline x _b =13.00\,\,\sigma _b =\sqrt ( D_b ) = 6.51$

2) $a=13.00-\sqrt 3 \cdot 6.51=13.00-1.732\cdot 6.51=1.72468$

$b=13.00+1.732\cdot 6.51=24.27532$

$b-a=24.27532-1.72468=22.55064$

3) $P_i =P(( x_i

$P_2 =(( 3

$P_3 =(( 7

$P_4 =(( 11

$P_5 =(( 15

$P_6 =(( 19

In a uniform distribution, if the length of the interval is the same, then $P_i -$ are the same.

4) Find $n_i" =np_i $.

5) Let's find $\sum ( \frac ( (( n_i -n_i" ))^2 ) ( n_i" ) ) $ and find $\chi _ ( obs ) ^2 $.

Let's enter all the obtained values ​​into the table

\begin(array) ( |l|l|l|l|l|l|l| ) \hline i& n_i & n_i" =np_i & n_i -n_i" & (( n_i -n_i" ))^2& \frac ( (( n_i -n_i" ))^2 ) ( n_i" ) & Control~ \frac ( n_i^2 ) ( n_i" ) \\ \hline 1& 1& 4.43438& -3.43438& 11.7950& 2.659898& 0.22551 \\ \hline 2& 6& 4.43438& 1.56562& 2.45117& 0.552765& 8.11838 \\ \hline 3& 3& 4.43438& -1.43438& 2.05744& 0.471463& 2.0296 \\ \hline 4& 3& 4 ,43438& -1.43438& 2.05744& 0.471463& 2.0296 \\ \hline 5& 6& 4.43438& 1.56562& 2.45117& 0.552765& 8.11838 \\ \hline 6& 6& 4.43438& 1.56562& 2, 45117& 0.552765& 8.11838 \\ \hline & & & & & \sum = \chi _ ( obs ) ^2 =3.261119& \chi _ ( obs ) ^2 =\sum ( \frac ( n_i^2 ) ( n_i" ) -n ) =3.63985 \\ \hline \end(array)

$\chi _ ( cr ) ^2 (( 0.05.3 ))=7.8$

$\chi _ ( obs ) ^2<\chi _ { кр } ^2 =3,26<7,8$

Conclusion there is no reason to reject the hypothesis.

GOAL OF THE WORK

The purpose of this laboratory work is:

· construction, based on the results of the experiment, of distribution laws for the random variable of the scatter of parameters of non-wire resistors;

· testing the hypothesis about the normal law of distribution of deviations of element parameters;

· experimental study of changes in the parameters of non-wire resistors when exposed to temperature.

WORK DURATION

Laboratory work is carried out during a 4-hour lesson, including 1 hour for a colloquium to assess students' knowledge of the theoretical part.

THEORETICAL PART

Radio-electronic equipment is constantly under the influence of external and internal disturbing random factors, under the influence of which the parameters of the device elements change. Changes in the parameters of elements (resistors, capacitors, semiconductor devices, integrated circuits, etc.) are associated with various physical processes occurring in materials due to external influences and aging. In addition, the parameters of RES elements have a production scatter, which is the result of the influence of random factors during their manufacture. Equipment designed from such elements reacts to all variations by changing its output parameters. To predict the reliability of RES, there is a need to establish laws for the distribution of the random value of the scatter of the parameters of the elements, determined by their production and disturbing external conditions (in particular, ambient temperature).

In laboratory work, using goodness-of-fit tests (Pearson or Kolmogorov), the hypothesis about the normal distribution law of the random variable X - the scatter of the parameters of the elements - is tested.

AGREEMENT CRITERIA USED TO TEST STATISTICAL HYPOTHESES

Goodness-of-fit criteria allow us to assess the probability of the assumption that the sample obtained from the experiment does not contradict the a priori chosen distribution law of the random variable under consideration. The solution to this problem is based on the use of the fundamental position of mathematical statistics, according to which The empirical (statistical) distribution function converges in probability to the prior (comparable theoretical) distribution function when the sample size increases without limit, provided that the sample belongs to the prior distribution in question. For a finite sample value, the empirical and a priori distribution functions will, generally speaking, differ from each other. Therefore, for the sample X 1 , X 2 ,… x n random variable X a certain numerical measure of discrepancy (goodness-of-fit criterion) () of the empirical distribution function is introduced

, l =1, 2, …, n , (1)

Where

= X 1 , X 2 ,… x n– sample of experimental data

and a priori – distribution function.

The rule for testing the hypothesis about the agreement between the a priori and empirical distributions is formulated as follows: if

then the hypothesis that the prior distribution to which the sample belongs X 1 , X 2 ,…,x n equal to F(X) must be rejected. To determine the threshold value WITH a certain acceptable probability a of rejecting the hypothesis that the sample belongs to the distribution is established F. Probability a is called the significance level of the goodness-of-fit criterion. Then

those. WITH– the threshold value of the criterion is equal to the a-percentage point of the distribution function of the divergence measure.

The event can also occur if the hypothesis put forward about the distribution law is true. However, if a is small enough, then the possibility of such situations can practically be neglected. Commonly specified values ​​for a are a = 0.05 and a = 0.01.

If the distribution law of the divergence measure () does not depend on F, then the rule for rejecting the agreement hypothesis and F

(4)

does not depend on the prior distribution. Such criteria are called nonparametric (see section 3.1.2).

The hypothesis about the nature of the distribution can be tested using the goodness-of-fit test in a different sequence: using the obtained value, it is necessary to determine the probability a n= R{ n). If the resulting value a n < a , то отклонения значимые; если an³ a, then the deviations are not significant. Values ​​of a n, very close to 1 (very good agreement) may indicate poor quality of the sample (for example, elements that give large deviations from the average were thrown out of the original sample without reason).

The goodness-of-fit criteria used in statistics differ from each other by various measures of discrepancy between the statistical and theoretical distribution laws (). Some of them are discussed below.

3.1.1. Agreement criterion c 2

When using the goodness-of-fit criterion c 2 (Pearson's criterion), the measure of discrepancy between the empirical and prior distributions is determined as follows.

The range of possible values ​​on which it is defined F(x) - the a priori distribution function is divided into a finite number of non-overlapping intervals – , i = 1, 2,…, L.

Let us introduce the notation: – a priori probability of hitting a sample value in the interval

It's obvious that . Let the elements of the observed sample X 1 , X 2 ,…, x n belong to the interval.

It's clear that .

Let us take as a measure of the discrepancy between the empirical and a priori distributions the value

, (5)

where is the experimental hit number of random variable values x in the interval,

L– the number of intervals into which all experimental values ​​of the quantity are divided x,

n– sample size,

p i– probability of hitting a random variable x in the -th interval, calculated for the theoretical distribution law (the product determines the number of hits in the - interval for the theoretical law).

As Pearson proved, when n® ¥ the distribution law of quantity (5) tends to - distribution with S = L- 1 degrees of freedom, unless the hypothesis about the distribution is true.

If a complex hypothesis is being tested that the sample belongs to the distribution , where the unknown parameter (scalar or vector) of the distribution is , then an estimate of the unknown parameter is determined from the experiment (based on the resulting sample). In this case, S - the number of degrees of freedom c 2 - distribution is equal to L – r – 1, Where r– number of estimated distribution parameters. .

The rule for testing the hypothesis about whether a sample belongs to a distribution can be formulated as follows: with a sufficiently large n(n> 50) and for a given significance level a, the hypothesis is rejected if

where - a - percentage point - distributions with degrees of freedom.

Kolmogorov criterion

Let us take as a measure of the discrepancy between the a priori and empirical distributions the statistics

().= , (7)

where is the upper limit of the difference module for all obtained values X.

The distribution of this statistic (random variable) for any n does not depend on

If only a sample X 1 , X 2 ,… x n on which it is constructed belongs and this latter is a continuous function. However, the exact expression for the distribution function at a finite value n very cumbersome . A.N. Kolmogorov found a fairly simple asymptotic expression (for ) for the functions:

, z> 0. (8) Thus, for large sample sizes (with n> 50), using (8) we get

Theoretical and empirical frequencies. Checking for normal distribution

When analyzing variation distribution series, it is of great importance how empirical distribution sign corresponds normal. To do this, the frequencies of the actual distribution must be compared with the theoretical ones, which are characteristic of a normal distribution. This means that, based on actual data, it is necessary to calculate the theoretical frequencies of the normal distribution curve, which are a function of normalized deviations.

In other words, the empirical distribution curve needs to be aligned with the normal distribution curve.

Objective characteristics of compliance theoretical And empirical frequencies can be obtained using special statistical indicators called consent criteria.

Agreement criterion called a criterion that allows you to determine whether the discrepancy is empirical And theoretical distributions are random or significant, i.e. whether the observational data agree with the put forward statistical hypothesis or do not agree. The distribution of the population, which it has due to the hypothesis put forward, is called theoretical.

There is a need to install criterion(rule) that would allow one to judge whether the discrepancy between the empirical and theoretical distributions is random or significant. If the discrepancy turns out to be random, then they believe that the observational data (sample) are consistent with the hypothesis put forward about the law of distribution of the general population and, therefore, the hypothesis is accepted; if the discrepancy turns out to be significant, then the observational data do not agree with the hypothesis and it is rejected.

Typically, empirical and theoretical frequencies differ because:

    the discrepancy is random and due to a limited number of observations;

    the discrepancy is not accidental and is explained by the fact that the statistical hypothesis that the population is normally distributed is erroneous.

Thus, consent criteria make it possible to reject or confirm the correctness of the hypothesis put forward when aligning the series about the nature of the distribution in the empirical series.

Empirical frequencies obtained as a result of observation. Theoretical frequencies calculated using formulas.

For normal distribution law they can be found as follows:

    Σƒ i- sum of accumulated (cumulative) empirical frequencies

    h - difference between two neighboring options

    σ - sample standard deviation

    t–normalized (standardized) deviation

    φ(t)–probability density function of normal distribution (found from the table of values ​​of the local Laplace function for the corresponding value of t)

There are several goodness-of-fit tests, the most common of which are: chi-square test (Pearson), Kolmogorov test, Romanovsky test.

Pearson χ goodness-of-fit test 2 – one of the main ones, which can be represented as the sum of the ratios of the squares of the differences between theoretical (f T) and empirical (f) frequencies to theoretical frequencies:

    k is the number of groups into which the empirical distribution is divided,

    f i – observed frequency of the trait in the i-th group,

    f T – theoretical frequency.

For the χ 2 distribution, tables have been compiled that indicate the critical value of the χ 2 goodness-of-fit criterion for the selected significance level α and degrees of freedom df (or ν). The significance level α is the probability of erroneously rejecting the proposed hypothesis, i.e. the probability that a correct hypothesis will be rejected. R - statistical significance accepting the correct hypothesis. In statistics, three levels of significance are most often used:

α=0.10, then P=0.90 (in 10 cases out of 100)

α=0.05, then P=0.95 (in 5 cases out of 100)

α=0.01, then P=0.99 (in 1 case out of 100) the correct hypothesis can be rejected

The number of degrees of freedom df is defined as the number of groups in the distribution series minus the number of connections: df = k –z. The number of connections is understood as the number of indicators of the empirical series used in calculating theoretical frequencies, i.e. indicators connecting empirical and theoretical frequencies. For example, when aligned with a bell curve, there are three relationships. Therefore, when aligned by bell curve the number of degrees of freedom is defined as df =k–3. To assess significance, the calculated value is compared with the table χ 2 table

If the theoretical and empirical distributions completely coincide, χ 2 =0, otherwise χ 2 >0. If χ 2 calc > χ 2 tab, then for a given level of significance and number of degrees of freedom, we reject the hypothesis about the insignificance (randomness) of the discrepancies. If χ 2 calculated< χ 2 табл то гипотезу принимаем и с вероятностью Р=(1-α) можно утверждать, что расхождение между теоретическими и эмпирическими частотами случайно. Следовательно, есть основания утверждать, что эмпирическое распределение подчиняетсяnormal distribution. Pearson's goodness-of-fit test is used if the population size is large enough (N>50), and the frequency of each group must be at least 5.

Kolmogorov goodness-of-fit test is based on determining the maximum discrepancy between the accumulated empirical and theoretical frequencies:

where D and d are, respectively, the maximum difference between the accumulated frequencies and the accumulated frequencies of the empirical and theoretical distributions. Using the distribution table of the Kolmogorov statistics, the probability is determined, which can vary from 0 to 1. When P(λ) = 1, there is a complete coincidence of frequencies, P(λ) = 0 - a complete discrepancy. If the probability value P is significant in relation to the found value λ, then we can assume that the discrepancies between the theoretical and empirical distributions are insignificant, that is, they are random. The main condition for using the Kolmogorov criterion is a sufficiently large number of observations.

Kolmogorov goodness-of-fit test

Let us consider how the Kolmogorov criterion (λ) is applied when testing the hypothesis of normal distribution general population. Aligning the actual distribution with the bell curve consists of several steps:

    Compare actual and theoretical frequencies.

    Based on actual data, the theoretical frequencies of the normal distribution curve, which is a function of the normalized deviation, are determined.

    They check to what extent the distribution of the characteristic corresponds to normal.

For the IV column of the table:

In MS Excel, the normalized deviation (t) is calculated using the NORMALIZATION function. It is necessary to select a range of free cells by the number of options (spreadsheet rows). Without removing the selection, call the NORMALIZE function. In the dialog box that appears, indicate the following cells, which contain, respectively, the observed values ​​(X i), average (X) and standard deviation Ϭ. The operation must be completed simultaneous by pressing Ctrl+Shift+Enter

For the V column of the table:

The probability density function of the normal distribution φ(t) is found from the table of values ​​of the local Laplace function for the corresponding value of the normalized deviation (t)

For column VI of the table:

Kolmogorov goodness-of-fit test (λ) determined by dividing the module max difference between empirical and theoretical cumulative frequencies by the square root of the number of observations:

Using a special probability table for the agreement criterion λ, we determine that the value λ = 0.59 corresponds to a probability of 0.88 (λ

Distribution of empirical and theoretical frequencies, probability density of theoretical distribution

When applying goodness-of-fit tests to check whether the observed (empirical) distribution corresponds to the theoretical one, one should distinguish between testing simple and complex hypotheses.

The one-sample Kolmogorov-Smirnov normality test is based on maximum difference between the cumulative empirical distribution of the sample and the estimated (theoretical) cumulative distribution. If the Kolmogorov-Smirnov D statistic is significant, then the hypothesis that the corresponding distribution is normal should be rejected.

The hypothesis being tested is usually called the null hypothesis. H 0, the rule by which a hypothesis is accepted or rejected is called a statistical criterion. Statistical criteria used to test hypotheses about the type of distribution laws are called goodness-of-fit criteria. Those. the agreement criteria establish when the discrepancies actually obtained between the assumed theoretical and experimental distributions are: insignificant - random and when significant - non-random.

Let's consider a random variable that characterizes the type or function of the discrepancy between the expected theoretical and experimental distribution of the attribute, then from the existing experimental distribution, we can determine the value a, which the random variable has taken, if its distribution law is known, then it is not difficult to find the probability that the random variable will take a value no less than a. If the value a obtained as a result of observing a random variable x, i.e. when the characteristic under consideration is distributed according to the assumed theoretical law, then the probability should not be small. If the probability turns out to be small, then this is explained by the fact that the actual value obtained is not a random variable x, and some other with a different distribution law, i.e. the characteristic being studied is not distributed according to the expected law. Thus, in the case when the discrepancy between the empirical and theoretical distributions is not small, it should be considered not significant - random, and the experimental and theoretical distributions are not contradictory, i.e. consistent with each other.

If the probability is low, then the discrepancies between the experimental and theoretical distributions are significant, they cannot be explained by chance, and the hypothesis about the distribution of the characteristic according to the supposed theoretical law should be considered not confirmed, it does not agree with the experimental data. It is necessary, after carefully studying the experimental data, to try to find a new law on the quality of the proposed characteristic, which would better and more fully reflect the characteristics of the experimental distribution; such probabilities are considered small and are taken not to exceed 0.1.

Pearson's goodness-of-fit tests or criteriac 2 .

Let the analysis of experimental data lead to the selection of a certain distribution law as assumed for the characteristic under consideration, and according to the experimental data as a result of n-observations, parameters were found (if they were not known earlier). Let us denote by n i- empirical frequencies of a random variable x.

n×P i-theoretical frequencies representing the product of the number of observations n on probability P i- calculated according to the assumed theoretical distribution. Consent criteria c 2 the measure of the discrepancy between the theoretical and empirical frequency series is taken to be


;

c 2- the quantity called c 2 distribution or Pearson distribution. It is equal to 0 only when all empirical and theoretical frequencies coincide; in other cases it is different from 0 and the greater the discrepancy between the indicated frequencies, the greater the difference. It has been proven that the selected characteristic c 2 or the statistics for n®¥ has a Pearson distribution with degrees of freedom

k=m-s- 1.

Where m-the number of intervals of the empirical distribution of the variation series or the number of groups.

s-the number of parameters of the theoretical distribution determined from experimental data (for example, in the case of a normal distribution, the number of parameters estimated from the sample is 2).

The scheme for applying the criterion is as follows:

1. Based on experimental data, select the distribution law of the characteristic as the expected one and find its parameters.

2. Using the resulting distribution, the theoretical frequencies corresponding to the experimental frequencies are determined.

3. Small experimental frequencies, if any, are combined with neighboring ones, then the value is determined using the formula c 2 .

4. Determine the number of degrees of freedom k .

5. From the application tables for the selected significance level a find the critical value when the number of degrees of freedom is equal k .

6. We formulate a conclusion, guided by the general principle of applying agreement criteria, namely, if the probability is >0.01, then the existing discrepancies between theoretical and experimental frequencies are considered insignificant.

If the actual observed value is greater than the critical value, then H 0 rejected if the hypothesis does not contradict experimental data. Criterion c 2 gives satisfactory results if there is a sufficient number of observations in each grouping interval n i .

Note: If in any interval the number of observations<5, то имеет смысл объединить соседние интервалы с тем, чтобы в объединенных интервалах n i was no less than 5. Moreover, when calculating the number of degrees of freedom k as m- a correspondingly reduced number of intervals is taken.

The following distribution of 100 workshop workers according to production in the reporting year was obtained

(in % of the previous year).



Did you like the article? Share with your friends!