Understanding Data Noise

Analysis of interactions with a site commonly involve data sets that include 'noise' that may affect results.Noise is data that does not typically reflect the main trends thus making these trends more difficult to identify.
Providing useful customer analysis requires an understanding of this noise and the ways in which it can be cleansed to reveal informative data segments. The following document demonstrates the aspect of noise in visitor analysis and the process for identifying useful data sets.
Introduction
Reporting on customer interactions with a site should enable clients to gain insight into the research to buy cycle for their customers.
However, it is important to understand that information can be affected by both (i)cookie deletion rates and (ii)noisy data sets. Both these factors are managed by gaining an understanding of how they affect the data and thus what data to view as significant. As with any statistical analysis, choosing the appropriate data segments is crucial to obtaining actionable data.
Cookie Deletion
Identifying visitors involves reliance upon cookies. Over time the likelihood that users will delete cookies increases. Deletion of cookies means that the visitor cannot be identified and they are thus treated as a new visitor. As such, it is important to choose a data range that keeps this affect to a minimum.
Noisy Data Sets
Data sets containing cookie information can be extremely long-tailed. This means that the majority of the traffic is represented by a small segment with the rest of the data accounted for by
individuals or very small groups.
The choice of measurement for such data sets is thus crucial – attempting to use the mean average would skew the information in favour of individual anomalies and would fail to represent the behaviour of the majority of the traffic. For example, one conversion occurring after 500 days could shift the mean average of a conversion latency report significantly. This would not reflect the majority of behaviour and would be less useful to the client than a figure that was less sensitive to random anomalies.
Check out our latest white paper on campaign attribution.

