Statistical reliability. Assessing the reliability of the results of a statistical study

Hypotheses are tested using statistical analysis. Statistical significance is found using the P-value, which corresponds to the probability of a given event assuming that some statement (null hypothesis) is true. If the P-value is less than a specified level of statistical significance (usually 0.05), the experimenter can safely conclude that the null hypothesis is false and proceed to consider the alternative hypothesis. Using the Student's t test, you can calculate the P-value and determine significance for two data sets.

Steps

Part 1

Setting up the experiment

Define your hypothesis. The first step in assessing statistical significance is to choose the question you want to answer and formulate a hypothesis. A hypothesis is a statement about experimental data, their distribution and properties. For any experiment, there is both a null and an alternative hypothesis. Generally speaking, you will have to compare two sets of data to determine whether they are similar or different.

The null hypothesis (H 0) typically states that there is no difference between two sets of data. For example: those students who read the material before class do not receive higher grades.
The alternative hypothesis (H a) is the opposite of the null hypothesis and is a statement that needs to be supported by experimental data. For example: those students who read the material before class get higher grades.

Set the significance level to determine how much the data distribution must differ from normal before it can be considered a significant result. Significance level (also called α (\displaystyle \alpha )-level) is the threshold you define for statistical significance. If the P-value is less than or equal to the significance level, the data is considered statistically significant.
- As a rule, the significance level (value α (\displaystyle \alpha )) is taken to be 0.05, in which case the probability of detecting a random difference between different data sets is only 5%.
- The higher the significance level (and, accordingly, the lower the P-value), the more reliable the results.
- If you want more reliable results, lower the P-value to 0.01. Typically, lower P-values are used in manufacturing when it is necessary to identify defects in products. In this case, high reliability is required to be sure that all parts work as expected.
- For most hypothesis experiments, a significance level of 0.05 is sufficient.
Decide which criterion you will use: one-sided or two-sided. One of the assumptions in the Student t test is that the data is normally distributed. The normal distribution is a bell-shaped curve with the maximum number of results in the middle of the curve. Student's t-test is a mathematical method of testing data that allows you to determine whether data falls outside the normal distribution (more, less, or in the “tails” of the curve).
- If you are not sure whether the data is above or below the control group values, use a two-tailed test. This will allow you to determine significance in both directions.
- If you know in which direction the data might fall outside the normal distribution, use a one-tailed test. In the example above, we expect students' grades to increase, so a one-tailed test can be used.
Determine sample size using statistical power. The statistical power of a study is the probability that, given a given sample size, the expected result will be obtained. A common power threshold (or β) is 80%. Analyzing statistical power without any prior data can be challenging because it requires some information about the expected means in each group of data and their standard deviations. Use an online power analysis calculator to determine the optimal sample size for your data.
- Typically, researchers conduct a small pilot study that provides data for statistical power analysis and determines the sample size needed for a larger, more complete study.
- If you are unable to conduct a pilot study, try to estimate possible averages based on the literature and other people's results. This may help you determine the optimal sample size.
Part 2
Calculate Standard Deviation
1. Write down the formula for standard deviation. The standard deviation shows how much spread there is in the data. It allows you to conclude how close the data obtained from a certain sample are. At first glance, the formula seems quite complicated, but the explanations below will help you understand it. The formula is as follows: s = √∑((x i – µ) 2 /(N – 1)).
  - s - standard deviation;
  - the sign ∑ indicates that all data obtained from the sample should be added;
  - x i corresponds to the i-th value, that is, a separate result obtained;
  - µ is the average value for a given group;
  - N is the total number of data in the sample.
2. Find the average in each group. To calculate the standard deviation, you must first find the mean for each study group. The mean value is denoted by the Greek letter µ (mu). To find the average, simply add up all the resulting values and divide them by the amount of data (sample size).
  - For example, to find the average grade for a group of students who study before class, consider a small data set. For simplicity, we use a set of five points: 90, 91, 85, 83 and 94.
  - Let's add all the values together: 90 + 91 + 85 + 83 + 94 = 443.
  - Let's divide the sum by the number of values, N = 5: 443/5 = 88.6.
  - Thus, the average for this group is 88.6.
3. Subtract each value obtained from the average. The next step is to calculate the difference (x i – µ). To do this, subtract each value obtained from the found average value. In our example, we need to find five differences:
  - (90 – 88.6), (91 – 88.6), (85 – 88.6), (83 – 88.6) and (94 – 88.6).
  - As a result, we get the following values: 1.4, 2.4, -3.6, -5.6 and 5.4.
4. Square each value obtained and add them together. Each of the quantities just found should be squared. This step will remove all negative values. If after this step you still have negative numbers, then you forgot to square them.
  - For our example, we get 1.96, 5.76, 12.96, 31.36 and 29.16.
  - We add up the resulting values: 1.96 + 5.76 + 12.96 + 31.36 + 29.16 = 81.2.
5. Divide by sample size minus 1. In the formula, the sum is divided by N – 1 due to the fact that we do not take into account the general population, but take a sample of all students for evaluation.
  - Subtract: N – 1 = 5 – 1 = 4
  - Divide: 81.2/4 = 20.3
6. Take the square root. After you divide the sum by the sample size minus one, take the square root of the value found. This is the last step in calculating the standard deviation. There are statistical programs that, after entering the initial data, perform all the necessary calculations.
  - In our example, the standard deviation of the grades of those students who read the material before class is s =√20.3 = 4.51.
  Part 3
  Determine significance
  1. Calculate the variance between the two groups of data. Before this step, we looked at an example for only one group of data. If you want to compare two groups, you should obviously take data from both groups. Calculate the standard deviation for the second group of data, and then find the variance between the two experimental groups. The variance is calculated using the following formula: s d = √((s 1 /N 1) + (s 2 /N 2)).

What do you think makes your “other half” special and meaningful? Is it related to her/his personality or to your feelings that you have for this person? Or maybe with the simple fact that the hypothesis about the randomness of your sympathy, as studies show, has a probability of less than 5%? If we consider the last statement to be reliable, then successful dating sites would not exist in principle:

When you conduct split testing or any other analysis of your website, misunderstanding “statistical significance” can lead to misinterpretation of the results and, therefore, incorrect actions in the conversion optimization process. This is true for the thousands of other statistical tests performed every day in every existing industry.

To understand what “statistical significance” is, you need to dive into the history of the term, learn its true meaning, and understand how this “new” old understanding will help you correctly interpret the results of your research.

A little history

Although humanity has been using statistics to solve various problems for many centuries, the modern understanding of statistical significance, hypothesis testing, randomization and even Design of Experiments (DOE) began to take shape only at the beginning of the 20th century and is inextricably linked with the name of Sir Ronald Fisher (Sir Ronald Fisher, 1890-1962):

Ronald Fisher was an evolutionary biologist and statistician who had a special passion for the study of evolution and natural selection in the animal and plant kingdoms. During his illustrious career, he developed and popularized many useful statistical tools that we still use today.

Fisher used the techniques he developed to explain processes in biology such as dominance, mutations and genetic deviations. We can use the same tools today to optimize and improve the content of web resources. The fact that these analysis tools can be used to work with objects that did not even exist at the time of their creation seems quite surprising. It is equally surprising that people used to perform complex calculations without calculators or computers.

To describe the results of a statistical experiment as having a high probability of being true, Fisher used the word “significance.”

Also, one of Fisher’s most interesting developments can be called the “sexy son” hypothesis. According to this theory, women prefer sexually promiscuous men (promiscuous) because this will allow the sons born to these men to have the same predisposition and produce more offspring (note that this is just a theory).

But no one, even brilliant scientists, is immune from making mistakes. Fisher's flaws still plague specialists to this day. But remember the words of Albert Einstein: “Whoever has never made a mistake has never created anything new.”

Before moving on to the next point, remember: statistical significance is when the difference in test results is so large that the difference cannot be explained by random factors.

What is your hypothesis?

To understand what “statistical significance” means, you first need to understand what “hypothesis testing” is, since the two terms are closely intertwined.
A hypothesis is just a theory. Once you have developed a theory, you will need to establish a process for collecting enough evidence and actually collecting that evidence. There are two types of hypotheses.

Apples or oranges - which is better?

Null hypothesis

As a rule, this is where many people experience difficulties. One thing to keep in mind is that a null hypothesis is not something that needs to be proven, like you prove that a certain change on a website will lead to an increase in conversions, but vice versa. The null hypothesis is a theory that states that if you make any changes to the site, nothing will happen. And the goal of the researcher is to refute this theory, not prove it.

If we look at the experience of solving crimes, where investigators also form hypotheses as to who the criminal is, the null hypothesis takes the form of the so-called presumption of innocence, the concept according to which the accused is presumed innocent until proven guilty in a court of law.

If the null hypothesis is that two objects are equal in their properties, and you are trying to prove that one is better (for example, A is better than B), you need to reject the null hypothesis in favor of the alternative. For example, you are comparing one or another conversion optimization tool. In the null hypothesis, they both have the same effect (or no effect) on the target. In the alternative, the effect of one of them is better.

Your alternative hypothesis may contain a numerical value, such as B - A > 20%. In this case, the null hypothesis and the alternative can take the following form:

Another name for an alternative hypothesis is a research hypothesis because the researcher is always interested in proving this particular hypothesis.

Statistical significance and p value

Let's return again to Ronald Fisher and his concept of statistical significance.

Now that you have a null hypothesis and an alternative, how can you prove one and disprove the other?

Since statistics, by their very nature, involve the study of a specific population (sample), you can never be 100% sure of the results obtained. A good example: election results often differ from the results of preliminary polls and even exit pools.

Dr. Fisher wanted to create a dividing line that would let you know whether your experiment was a success or not. This is how the reliability index appeared. Credibility is the level we take to say what we consider “significant” and what we don’t. If "p", the significance index, is 0.05 or less, then the results are reliable.

Don't worry, it's actually not as confusing as it seems.

Gaussian probability distribution. Along the edges are the less probable values of the variable, in the center are the most probable. The P-score (green shaded area) is the probability of the observed outcome occurring by chance.

The normal probability distribution (Gaussian distribution) is a representation of all possible values of a certain variable on a graph (in the figure above) and their frequencies. If you do your research correctly and then plot all your answers on a graph, you will get exactly this distribution. According to the normal distribution, you will receive a large percentage of similar answers, and the remaining options will be located at the edges of the graph (the so-called “tails”). This distribution of values is often found in nature, which is why it is called “normal”.

Using an equation based on your sample and test results, you can calculate what is called a “test statistic,” which will indicate how much your results deviate. It will also tell you how close you are to the null hypothesis being true.

To help you get your head around it, use online calculators to calculate statistical significance:

One example of such calculators

The letter "p" represents the probability that the null hypothesis is true. If the number is small, it will indicate a difference between the test groups, whereas the null hypothesis would be that they are the same. Graphically, it will look like your test statistic will be closer to one of the tails of your bell-shaped distribution.

Dr. Fisher decided to set the significance threshold at p ≤ 0.05. However, this statement is controversial, since it leads to two difficulties:

1. First, the fact that you have proven the null hypothesis false does not mean that you have proven the alternative hypothesis. All this significance just means that you can't prove either A or B.

2. Secondly, if the p-score is 0.049, it will mean that the probability of the null hypothesis will be 4.9%. This may mean that your test results may be both true and false at the same time.

You may or may not use the p-score, but then you will need to calculate the probability of the null hypothesis on a case-by-case basis and decide whether it is large enough to prevent you from making the changes you planned and tested.

The most common scenario for conducting a statistical test today is to set a significance threshold of p ≤ 0.05 before running the test itself. Just be sure to look closely at the p-value when checking your results.

Errors 1 and 2

So much time has passed that errors that can occur when using the statistical significance metric have even been given their own names.

Type 1 Errors

As mentioned above, a p-value of 0.05 means there is a 5% chance that the null hypothesis is true. If you don't, you'll be making mistake number 1. The results say your new website increased your conversion rates, but there's a 5% chance that it didn't.

Type 2 Errors

This error is the opposite of error 1: you accept the null hypothesis when it is false. For example, test results tell you that the changes made to the site did not bring any improvements, while there were changes. As a result, you miss the opportunity to improve your performance.

This error is common in tests with an insufficient sample size, so remember: the larger the sample, the more reliable the result.

Conclusion

Perhaps no term is as popular among researchers as statistical significance. When test results are not found to be statistically significant, the consequences range from an increase in conversion rates to the collapse of a company.

And since marketers use this term when optimizing their resources, you need to know what it really means. Test conditions may vary, but sample size and success criteria are always important. Remember this.

Task 3. Five preschoolers are given a test. The time taken to solve each task is recorded. Will statistically significant differences be found between the time to solve the first three test items?

No. of subjects

Reference material

This assignment is based on the theory of analysis of variance. In general, the task of analysis of variance is to identify those factors that have a significant impact on the result of the experiment. Analysis of variance can be used to compare the means of several samples if there are more than two samples. One-way analysis of variance is used for this purpose.

In order to solve the assigned tasks, the following is accepted. If the variances of the obtained values of the optimization parameter in the case of influence of factors differ from the variances of the results in the absence of influence of factors, then such a factor is considered significant.

As can be seen from the formulation of the problem, methods for testing statistical hypotheses are used here, namely, the task of testing two empirical variances. Therefore, analysis of variance is based on testing variances using Fisher's test. In this task, it is necessary to check whether the differences between the time of solving the first three test tasks by each of the six preschoolers are statistically significant.

The null (main) hypothesis is called the put forward hypothesis H o. The essence of e comes down to the assumption that the difference between the compared parameters is zero (hence the name of the hypothesis - zero) and that the observed differences are random.

A competing (alternative) hypothesis is called H1, which contradicts the null hypothesis.

Solution:

Using the analysis of variance method at a significance level of α = 0.05, we will test the null hypothesis (H o) about the existence of statistically significant differences between the time of solving the first three test tasks for six preschoolers.

Let's look at the table of task conditions, in which we will find the average time to solve each of the three test tasks

No. of subjects	Factor levels
No. of subjects	Time to solve the first test task (in seconds).	Time to solve the second test task (in seconds).	Time to solve the third test task (in seconds).






Group average

Finding the overall average:

In order to take into account the significance of time differences in each test, the total sample variance is divided into two parts, the first of which is called factorial, and the second - residual

Let's calculate the total sum of squared deviations from the overall average using the formula

or , where p is the number of time measurements for solving test tasks, q is the number of test takers. To do this, let's create a table of squares

No. of subjects	Factor levels
No. of subjects	Time to solve the first test task (in seconds).	Time to solve the second test task (in seconds).	Time to solve the third test task (in seconds).

In any scientific and practical situation of an experiment (survey), researchers can study not all people (general population, population), but only a certain sample. For example, even if we are studying a relatively small group of people, such as those suffering from a particular disease, it is still very unlikely that we have the appropriate resources or the need to test every patient. Instead, it is common to test a sample from the population because it is more convenient and less time consuming. If so, how do we know that the results obtained from the sample are representative of the entire group? Or, to use professional terminology, can we be sure that our research correctly describes the entire population, the sample we used?

To answer this question, it is necessary to determine the statistical significance of the test results. Statistical significance (Significant level, abbreviated Sig.), or /7-significance level (p-level) - is the probability that a given result correctly represents the population from which the study was sampled. Note that this is only probability- it is impossible to say with absolute certainty that a given study correctly describes the entire population. At best, the significance level can only conclude that this is very likely. Thus, the next question inevitably arises: what level of significance must be before a given result can be considered a correct characterization of the population?

For example, at what probability value are you willing to say that such chances are enough to take a risk? What if the odds are 10 out of 100 or 50 out of 100? What if this probability is higher? What about odds like 90 out of 100, 95 out of 100, or 98 out of 100? For a situation involving risk, this choice is quite problematic, because it depends on the personal characteristics of the person.

In psychology, it is traditionally believed that a 95 or more chance out of 100 means that the probability of the results being correct is high enough for them to be generalizable to the entire population. This figure was established in the process of scientific and practical activity - there is no law according to which it should be chosen as a guideline (and indeed, in other sciences sometimes other values of the significance level are chosen).

In psychology, this probability is operated in a somewhat unusual way. Instead of the probability that the sample represents the population, the probability that the sample doesn't represent population. In other words, it is the probability that the observed relationship or differences are random and not a property of the population. So, instead of saying there is a 95 in 100 chance that the results of a study are correct, psychologists say that there is a 5 in 100 chance that the results are wrong (just as a 40 in 100 chance that the results are correct means a 60 in 100 chance in favor of their incorrectness). The probability value is sometimes expressed as a percentage, but more often it is written as a decimal fraction. For example, 10 chances out of 100 are expressed as a decimal fraction of 0.1; 5 out of 100 is written as 0.05; 1 out of 100 - 0.01. With this form of recording, the limit value is 0.05. For a result to be considered correct, its significance level must be below this number (remember, this is the probability that the result wrong describes the population). To get the terminology out of the way, let's add that the “probability of the result being incorrect” (which is more correctly called significance level) usually denoted by a Latin letter r. Descriptions of experimental results usually include a summary statement such as “the results were significant at the confidence level (p(p) less than 0.05 (i.e. less than 5%).

Thus, the significance level ( r) indicates the likelihood that the results Not represent the population. Traditionally in psychology, results are considered to reliably reflect the overall picture if the value r less than 0.05 (i.e. 5%). However, this is only a probabilistic statement, and not at all an unconditional guarantee. In some cases this conclusion may not be correct. In fact, we can calculate how often this might happen if we look at the magnitude of the significance level. At a significance level of 0.05, 5 out of 100 times the results are likely to be incorrect. 11a at first glance it seems that this is not very common, but if you think about it, then 5 chances out of 100 is the same as 1 out of 20. In other words, in one out of every 20 cases the result will be incorrect. Such odds do not seem particularly favorable, and researchers should beware of committing errors of the first kind. This is the name for the error that occurs when researchers think they have found real results, but in fact they have not. The opposite error, which consists in researchers believing that they have not found a result, but in fact there is one, is called errors of the second type.

These errors arise because the possibility that the statistical analysis performed cannot be ruled out. The probability of error depends on the level of statistical significance of the results. We have already noted that for a result to be considered correct, the significance level must be below 0.05. Of course, some results are of a lower level, and it is not uncommon to find results as low as 0.001 (a value of 0.001 indicates that the results have a 1 in 1000 chance of being wrong). The smaller the p value, the stronger our confidence in the correctness of the results.

In table 7.2 shows the traditional interpretation of significance levels about the possibility of statistical inference and the rationale for the decision about the presence of a relationship (differences).

Table 7.2

Traditional interpretation of significance levels used in psychology

Based on the experience of practical research, it is recommended: in order to avoid errors of the first and second types as much as possible, when drawing important conclusions, decisions should be made about the presence of differences (connections), focusing on the level r n sign.

Statistical test(Statistical Test - it is a tool for determining the level of statistical significance. This is a decisive rule that ensures that a true hypothesis is accepted and a false hypothesis is rejected with high probability.

Statistical criteria also denote the method for calculating a certain number and the number itself. All criteria are used with one main purpose: to determine significance level the data they analyze (i.e., the likelihood that the data reflects a true effect that correctly represents the population from which the sample is drawn).

Some tests can only be used for normally distributed data (and if the trait is measured on an interval scale) - these tests are usually called parametric. Using other criteria, you can analyze data with almost any distribution law - they are called nonparametric.

Parametric criteria are criteria that include distribution parameters in the calculation formula, i.e. means and variances (Student's t-test, Fisher's F-test, etc.).

Nonparametric criteria are criteria that do not include distribution parameters in the formula for calculating distribution parameters and are based on operating with frequencies or ranks (criterion Q Rosenbaum criterion U Manna - Whitney

For example, when we say that the significance of the differences was determined by the Student's t-test, we mean that the Student's t-test method was used to calculate the empirical value, which is then compared with the tabulated (critical) value.

By the ratio of the empirical (calculated by us) and critical values of the criterion (tabular) we can judge whether our hypothesis is confirmed or refuted. In most cases, in order for us to recognize the differences as significant, it is necessary that the empirical value of the criterion exceeds the critical value, although there are criteria (for example, the Mann-Whitney test or the sign test) in which we must adhere to the opposite rule.

In some cases, the calculation formula for the criterion includes the number of observations in the sample under study, denoted as p. Using a special table, we determine what level of statistical significance of differences a given empirical value corresponds to. In most cases, the same empirical value of the criterion may be significant or insignificant depending on the number of observations in the sample under study ( n ) or from the so-called number of degrees of freedom , which is denoted as v (g>) or how df (Sometimes d).

Knowing n or the number of degrees of freedom, using special tables (the main ones are given in Appendix 5) we can determine the critical values of the criterion and compare the obtained empirical value with them. This is usually written like this: “when n = 22 critical values of the criterion are t St = 2.07" or "at v (d) = 2 critical values of the Student’s test are = 4.30”, etc.

Typically, preference is still given to parametric criteria, and we adhere to this position. They are considered to be more reliable and can provide more information and deeper analysis. As for the complexity of mathematical calculations, when using computer programs this complexity disappears (but some others appear, however, quite surmountable).

In this textbook we do not consider in detail the problem of statistical
hypotheses (null - R0 and alternative - Hj) and statistical decisions made, since psychology students study this separately in the discipline “Mathematical methods in psychology”. In addition, it should be noted that when preparing a research report (course or diploma work, publication), statistical hypotheses and statistical solutions, as a rule, are not given. Usually, when describing the results, they indicate the criterion, provide the necessary descriptive statistics (means, sigma, correlation coefficients, etc.), empirical values of the criteria, degrees of freedom, and necessarily the p-level of significance. Then a meaningful conclusion is formulated regarding the hypothesis being tested, indicating (usually in the form of an inequality) the level of significance achieved or not achieved.

The level of significance in statistics is an important indicator that reflects the degree of confidence in the accuracy and truth of the obtained (predicted) data. The concept is widely used in various fields: from conducting sociological research to statistical testing of scientific hypotheses.

Definition

The level of statistical significance (or statistically significant result) shows the probability of the occurrence of the studied indicators by chance. The overall statistical significance of a phenomenon is expressed by the p-value coefficient (p-level). In any experiment or observation, there is a possibility that the data obtained were due to sampling errors. This is especially true for sociology.

That is, a statistically significant value is a value whose probability of random occurrence is extremely small or tends to the extreme. The extreme in this context is considered to be the degree to which statistics deviate from the null hypothesis (a hypothesis that is tested for consistency with the obtained sample data). In scientific practice, the significance level is selected before data collection and, as a rule, its coefficient is 0.05 (5%). For systems where precise values are extremely important, this figure may be 0.01 (1%) or less.

Background

The concept of significance level was introduced by the British statistician and geneticist Ronald Fisher in 1925, when he was developing a technique for testing statistical hypotheses. When analyzing any process, there is a certain probability of certain phenomena. Difficulties arise when working with small (or not obvious) percentages of probabilities that fall under the concept of “measurement error.”

When working with statistical data that is not specific enough to test them, scientists are faced with the problem of the null hypothesis, which “prevents” operating with small quantities. Fisher proposed for such systems to determine the probability of events at 5% (0.05) as a convenient sampling cut that allows one to reject the null hypothesis in calculations.

Introduction of fixed odds

In 1933, scientists Jerzy Neumann and Egon Pearson recommended in their works that a certain level of significance be established in advance (before data collection). Examples of the use of these rules are clearly visible during elections. Let's say there are two candidates, one of whom is very popular and the other is little known. It is obvious that the first candidate will win the election, and the chances of the second tend to zero. They strive - but are not equal: there is always the possibility of force majeure, sensational information, unexpected decisions that can change the predicted election results.

Neyman and Pearson agreed that Fisher's significance level of 0.05 (denoted by α) was most appropriate. However, Fischer himself in 1956 opposed fixing this value. He believed that the level of α should be set according to specific circumstances. For example, in particle physics it is 0.01.

p-level value

The term p-value was first used by Brownlee in 1960. The P-level (p-value) is an indicator that is inversely related to the truth of the results. The highest p-value coefficient corresponds to the lowest level of confidence in the sampled relationship between variables.

This value reflects the likelihood of errors associated with the interpretation of the results. Let's assume p-level = 0.05 (1/20). It shows a five percent probability that the relationship between variables found in the sample is just a random feature of the sample. That is, if this dependence is absent, then with repeated similar experiments, on average, in every twentieth study one can expect the same or greater dependence between the variables. The p-level is often seen as a "margin" for the error rate.

By the way, p-value may not reflect the real relationship between variables, but only shows a certain average value within the assumptions. In particular, the final analysis of the data will also depend on the selected values of this coefficient. At p-level = 0.05 there will be some results, and at a coefficient equal to 0.01 there will be different results.

Testing statistical hypotheses

The level of statistical significance is especially important when testing hypotheses. For example, when calculating a two-tailed test, the rejection region is divided equally at both ends of the sampling distribution (relative to the zero coordinate) and the truth of the resulting data is calculated.

Suppose, when monitoring a certain process (phenomenon), it turns out that new statistical information indicates small changes relative to previous values. At the same time, the discrepancies in the results are small, not obvious, but important for the study. The specialist is faced with a dilemma: are changes really occurring or are these sampling errors (measurement inaccuracy)?

In this case, they use or reject the null hypothesis (attribute everything to an error, or recognize the change in the system as a fait accompli). The problem solving process is based on the ratio of overall statistical significance (p-value) and significance level (α). If p-level< α, значит, нулевую гипотезу отвергают. Чем меньше р-value, тем более значимой является тестовая статистика.

Values used

The level of significance depends on the material being analyzed. In practice, the following fixed values are used:

α = 0.1 (or 10%);
α = 0.05 (or 5%);
α = 0.01 (or 1%);
α = 0.001 (or 0.1%).

The more accurate the calculations are required, the lower the α coefficient is used. Naturally, statistical forecasts in physics, chemistry, pharmaceuticals, and genetics require greater accuracy than in political science and sociology.

Significance thresholds in specific areas

In high-precision fields such as particle physics and manufacturing, statistical significance is often expressed as the ratio of the standard deviation (denoted by the sigma coefficient - σ) relative to a normal probability distribution (Gaussian distribution). σ is a statistical indicator that determines the dispersion of the values of a certain quantity relative to mathematical expectations. Used to plot the probability of events.

Depending on the field of knowledge, the coefficient σ varies greatly. For example, when predicting the existence of the Higgs boson, the parameter σ is equal to five (σ = 5), which corresponds to p-value = 1/3.5 million. In genome studies, the significance level can be 5 × 10 -8, which is not uncommon for this areas.

Efficiency

It must be taken into account that coefficients α and p-value are not exact characteristics. Whatever the level of significance in the statistics of the phenomenon under study, it is not an unconditional basis for accepting the hypothesis. For example, the smaller the value of α, the greater the chance that the hypothesis being established is significant. However, there is a risk of error, which reduces the statistical power (significance) of the study.

Researchers who focus solely on statistically significant results may reach erroneous conclusions. At the same time, it is difficult to double-check their work, since they apply assumptions (which in fact are the α and p-values). Therefore, it is always recommended, along with calculating statistical significance, to determine another indicator - the magnitude of the statistical effect. Effect size is a quantitative measure of the strength of an effect.