Top.Mail.Ru
Sampling — Postmypost

Sampling

Nikiforov Alexander
Friend of clients
Back

Contents

What is Sampling?

Sampling is the process of analyzing a small portion of data to gain insights into the characteristics and parameters of the entire dataset. This term comes from the English word “sample,” which translates to a specimen or example. In the Russian-speaking segment of the internet, there are two variations of this word: "семплирование" and "сэмплирование." From a grammatical perspective, the first variant is considered correct, but both are used in everyday speech.

In mathematics, sampling encompasses a set of methods that allow for the formation of a sample – that is, selecting a small part of data from a large volume of information. The principle of sampling can be illustrated with an example: to understand what pizza tastes like, it is not necessary to eat the entire pizza. It is sufficient to try one slice. Similarly, conclusions about the characteristics and qualities of a large group of data are drawn by examining only a part of it.

When is Data Sampling Used?

Data sampling is an important element of various analytical tools. For instance, Google Analytics and Yandex.Metrica use sampling when processing large volumes of information and preparing web analytics reports, especially when the number of sessions exceeds a set limit.

Consider a situation: if 100 users visited a website and 11 of them clicked on a link from social media, the program can easily track each action and generate a report. However, when 10,000,000 users visit the site, analyzing each action becomes extremely complicated and requires significant computational resources. To optimize the task, the program may take a 10% sample, allowing it to select 1,000,000 users and extrapolate the data to the entire audience.

Sampling in Google Analytics

In standard reports, Google Analytics does not apply sampling. Full data is available in tabs such as “Audience,” “Traffic Sources,” “Behavior,” and “Conversions.” However, sampling may occur in the following cases:

  • When processing special queries where the volume of data exceeds the limit of 500,000 sessions (or 100,000,000 sessions for Google Analytics 360).
  • When modifying reports on multi-channel funnels that track a user’s journey from first contact with the company to purchase.
  • When adding parameters and filters, where the maximum sample size is 1,000,000 sessions.

To understand which data the service uses to build a report, simply pay attention to the color of the shield icon. A green icon indicates full data, while an orange icon signifies the use of sampling.

Sampling in Yandex.Metrica

Yandex.Metrica also applies sampling when compiling analytical reports. The limit is 500,000 sessions in the standard version, but there are no restrictions when using the "Metrica Pro" service. Unlike Google Analytics, sampling is not used when generating reports for the "Direct" category. To determine if sampling is applied to a specific report, one should look at the value of the "Accuracy" metric. If it equals 100%, the data is complete; otherwise, the program uses sampling.

Disadvantages of Data Sampling

The main disadvantage of sampling is that not all data is analyzed, which can lead to the loss of important information. When working with a sample, there is a risk of missing details or trends that might have been noticeable when analyzing the entire dataset. For example, if we have a box of balls, to understand all the colors and sizes, it is necessary to examine each one. But if there are too many balls, we may only select part of them, resulting in not seeing some colors. Sampling helps reduce analysis time and decrease server load, but it is not always possible to completely avoid it.

How to Avoid Data Sampling in Reports?

To minimize the impact of sampling and improve the accuracy of reports, the following steps can be taken:

  • Reduce the analysis period by creating a report for a shorter time frame.
  • Increase the volume and accuracy of the sample by using appropriate settings in Google Analytics and Yandex.Metrica.
  • Utilize additional tools such as "Metrica Pro" or Google Analytics 360, as well as BI systems and alternative services.
  • Create separate accounts for each website to simplify data management and avoid overload.