Notes on Know Your Data: The Stats behind the Alerts

I came across this amazing video from Dave McAllister in the “WTF is SRE 2023” conference on the statistics behind the alerts.

IMAGE ALT TEXT HERE

Thanks to Dave for his clear explanation of the various concepts.

Here are my notes on the video:

  • Mean vs Median vs Mode -
    • Mean = Measure of central tendency, the average value
    • Median - In a sorted data set, median is the middle value
    • Mode - Mode is the frequently used value in a dataset
  • Arithmetic mean - Not the 50% mark. Useful for comparing to previous conditions. when working with time series, arithmetic mean needs to be calculated consistently to include new data.
    • Moving/Block average
  • Geometric mean - For things growing exponentially, multiply everything together, and take the nth root.
    • Number of deploys/unit, MTTR, Throughput calculations
  • Harmonic mean - performance when there are multiple different systems involved.
    • Great for latency/throughput
    • Great for complex environments.
    • Divide n by the sum of reciprocals = n/(1/x^1 + 1/x^2 + …..+ 1/x^n)
    • Useful for outliers, represents the lowest value the most.
    • Throughput when there is a single system instead of multiple systems
  • Arithmetic mean > Geometric Mean > Harmonic Mean
  • Harmonic and geometric mean can only be used for non-zero datasets
  • Median
    • Always 50% point of a normal curve
    • Mean can be impacted by outlier and doesn’t recover spikes.
    • Response time monitoring, anomaly detection , capacity planning
  • Mode
    • Most commonly used value
    • Log Analysis or Security monitoring.
  • Probability - the possibility of an event happening
  • Statistics - summation of information that has happened.
  • Distributions
    • Normal = Data equally distributed.
      • Bell curve, not percent based
      • Lead time measurement, anomaly analysis, SLO/SLI calculation
    • Poisson - Used to model the occurrence of rare events
    • Beta - Used to track success/failure of binomial events
    • Exponential - Time between async events
      • Models the rate (time between events that are unrelated)
      • Network performance, user requests, messaging service, system failures
    • Weibull - Likelihood of failure
      • defined by a shape and a scale parameter
    • Log normal - Values based on many small events
  • Descriptive vs Inferential statistics
    • Descriptive uses whole data set to drive statistical conclusions.
      • Used for visualization, can define+extract trends
    • Inferential uses sampled data to draw conclusions.
      • Used for predictions or hypotenuse testing, can also visualize.
      • leads to sampling.
  • Monitoring is now becoming a data problem. Observability (signals, metrics, traces, logs) adds to the amounts of data being ingested. This brings a need for sampling.
  • Sampling - can give false indications, changes behavior from descriptive to inferential, necessary evil
    • Random sampling
    • Stratified sampling
    • Cluster sampling
    • Systematic sampling
    • Purposive sampling