Random variable and general sampling forecasting scheme

Unlike the intuitive forecasting methods discussed in the previous chapter, further study of forecasting methods is devoted to the study of formal methods. They received the name formal methods because they are based on sufficiently formalized (accurately described) actions that must be performed to obtain a forecast.

The problem of forecasting on a sample arises when there are data on several objects similar to the forecasting object in its properties and on the basis of this information it is necessary to predict the state of the desired forecasting object. The second option is that there are data on the state of the forecasting object in the past and it is necessary to predict its state in the future on their basis. In the first case, they say that there are cross-data, in the second, temporary.

In the vast majority of cases, the quantities used in the economy fluctuate continuously. For example, even with constant production volumes and prices for resources due to inevitable deviations in production processes, costs fluctuate, from period to period revenue, profit, and any parameters of macroeconomics fluctuate. All these are random or second name stochastic quantities (stochostykos – in Greek means guessing). From the standpoint of forecasting, a random variable is such a value of the next value that cannot be predicted accurately.

Random variables are usually divided into two types:

continuous random variable – a random variable that can take any value from a given range (in the range from to );

discrete random variable – a random variable that can take any value from a fixed set of permissible values (the number of children in the family is an integer positive number, the inflation rate is usually reported with an accuracy of tenths of a percent, etc.). Almost any quantity rounded to a certain accuracy is  discrete.

All possible values of a random variable – both past and future, both known and unknown – are called the general population of a random variable. Any part of this population is called a sample. In forecasting, the general population is never known because otherwise there is no forecasting task itself. Only a sample from the general population is known, i.e. the values of the general copulation by which it is necessary to predict it , , … Values. Since, by definition of a random variable, its next value changes with each of its appearances, it is necessary not only to predict its next value, but also to indicate within what limits its fluctuations should be expected. In other words, in addition to the forecast itself, it is necessary to assess its accuracy.

Thus, in the mathematical formulation, the problem of forecasting by sample is as follows. We have a sample from the general population with power. According to this sample, it is necessary to evaluate the parameters of the general population and, on their basis, to make a forecast of the next value of the random variable and indicate some measure of the accuracy of the obtained forecast.

Preliminary data analysis.

For the convenience of subsequent analysis, these values are usually sorted by increasing, in spreadsheets this operation is performed almost instantaneously. If the sample is large, and the random value is discrete, then there may be a large number of repeating values in the sample, and in this case it is more convenient to represent the sample as two rows of numbers:

and

where:        are the values of a random variable;

        is the number of repetitions of each i-th value.

The second option for representing a sample of a discrete random variable is to calculate the probability of the appearance of the i-th value of a random variable using the formula:

and represent the sample as , .

The main part of the preliminary analysis of the data is the construction of a histogram of a random variable from the sample data. A histogram is a bar chart on the horizontal axis of which are usually uniform intervals of a random variable, and on the vertical axis is the number of hits of a random variable in these intervals.

If the resulting histogram has more than one vertex (Fig. 5, a), then this is a signal that the initial data is not a sample of one random variable, but is the sum of two samples of two different random variables. For example, instead of cross-sectional data of the same class, there is data about objects belonging to two different classes, or data about the state of the forecasting object in the past refers to two of its different states – before any structural changes and after these changes. In all such cases, a significant error will be introduced into the prediction because the object belongs to one class or is in a specific state (after the change) rather than both (before and after the change). Therefore, if the histogram has more than one vertex, the original data should be carefully analyzed to remove data that is not related to the forecasting object.

The outliers on the histogram (Fig. 5, b) also deserve careful attention, especially if these outliers are located at some distance from the main figure of the histogram. Data corresponding to emissions are useful to study in detail as they usually signal the presence of failures in the process under study or other deviations from the usual course of affairs, including cases of abuse, theft, etc.

Finally, the appearance of the histogram allows us to approximate the nature of the distribution of a random variable. If the histogram resembles a symmetrical single-vertex figure, then further forecasting work can be carried out under the assumption that the random variable has a normal distribution of work with which is most simple in view of the good theoretical study of this distribution and the diversity of the techniques and processing methods developed for it. If this is not the case (Fig. 5, c), then it is necessary to use some other special distribution, which usually complicates the analysis task.

Fig. 5. Histograms of random variables.

It should be noted that the visual analysis of the initial data on the appearance of the histogram is approximate, since, firstly, the appearance of the histogram can change significantly when the number of intervals on the histogram changes, and secondly, there are no numerical criteria for diagnosing a particular assumption. There are more reliable statistical methods for testing the assumptions discussed above, but they require large (usually more than 50 – 100 points) samples, which is rare in the practice of forecasting and special processing methods. Visual analysis allows you to cost-effectively either get a result or identify those cases when special statistical processing is required. Therefore, taking into account the simplicity of the construction of histograms in spreadsheets, it should be considered that it is mandatory when building a forecast.