Statistics Formulas

# Statistics Formulas

Mean

The mean (average) is calculated by summing up all the values in a dataset and then dividing the sum by the total number of values. It represents the central tendency of the data.

Formula: Mean = (Σx) / n

Where:

• Mean is the average
• Σx is the sum of all values in the dataset
• n is the total number of values in the dataset
Median

The median is the middle value in a dataset when the values are arranged in ascending order.

If there is an even number of values, the median is the average of the two middle values.

Formula (Odd number of values): Median = Middle value

Formula (Even number of values): Median = (Value at position n/2 + Value at position (n/2 + 1)) / 2

Minimum

The minimum is the smallest value in a dataset.

Formula: Minimum = Smallest Value

Maximum

The maximum is the largest value in a dataset.

Formula: Maximum = Largest Value

Range

The range is the difference between the maximum and minimum values in a dataset. It provides a measure of the spread or variability in the data.

Formula: Range = Maximum - Minimum

Midrange

The midrange is the average of the maximum and minimum values in a dataset.

Formula: Midrange = (Maximum + Minimum) / 2

Count

The count represents the total number of values in a dataset.

Sum

The sum is the total of all values in a dataset.

Formula: Sum = Σx

Where:

• Σx is the sum of all values in the dataset
Percentile

A percentile represents the value below which a given percentage of the data falls. It is often used to identify specific data points in a distribution.

Quartile

A quartile divides a dataset into four equal parts, with each part containing 25% of the data. Quartiles are often used to assess the spread of data.

Sum of Squares

The sum of squares is the sum of the squares of the differences between each data point and the mean. It is a key component in calculating variance and standard deviation.

Formula: Sum of Squares = Σ(x - Mean)²

Where:

• Σ represents the summation symbol
• x is each data point
• Mean is the mean (average) of the dataset
Standard Deviation

The standard deviation measures the amount of variation or dispersion in a dataset. It indicates how spread out the data points are from the mean.

Formula: Standard Deviation = √(Σ(x - Mean)² / (n - 1))

Where:

• √ represents the square root
• Σ represents the summation symbol
• x is each data point
• Mean is the mean (average) of the dataset
• n is the total number of values in the dataset
Variance

The variance is a measure of the spread or dispersion of a dataset. It is the average of the squared differences between each data point and the mean.

Formula (Population Variance): Variance (σ²) = Σ(x - Mean)² / N

Where:

• Σ represents the summation symbol
• x is each data point
• Mean is the mean (average) of the dataset
• N is the total number of values in the population

Note: When working with a sample of data, use the sample variance formula, which divides by (N - 1) instead of N. This correction accounts for sample bias.

Z-Score

The Z-score measures how many standard deviations a data point is from the mean in a standard normal distribution. It is used to standardize data and assess its position relative to the mean.

Formula: Z-Score = (x - Mean) / Standard Deviation

Where:

• x is the data point
• Mean is the mean (average) of the dataset
• Standard Deviation is the standard deviation of the dataset
Interquartile Range (IQR)

The interquartile range is the range between the first quartile (Q1 - 25th percentile) and the third quartile (Q3 - 75th percentile) in a dataset. It provides a measure of the spread of the middle 50% of the data.

Formula: IQR = Q3 - Q1

Where:

• Q1 is the first quartile (25th percentile)
• Q3 is the third quartile (75th percentile)
Coefficient of Variation (CV)

The coefficient of variation is a relative measure of variability and is expressed as a percentage. It is used to compare the standard deviation of data to its mean, making it useful for assessing relative variability between datasets with different means.

Formula: CV = (Standard Deviation / Mean) * 100%

Skewness

Skewness measures the asymmetry of the probability distribution of a real-valued random variable. It indicates whether the data is skewed to the right or left.

A positive skew indicates that the distribution tail is skewed to the right (right-skewed), meaning there are more extreme values on the right side of the distribution.

A negative skew indicates that the distribution tail is skewed to the left (left-skewed), meaning there are more extreme values on the left side of the distribution.

Kurtosis

Kurtosis measures the "tailedness" of the probability distribution of a real-valued random variable. It indicates the presence and degree of outliers in the data.

A positive kurtosis (leptokurtic) indicates heavy tails and a peak, meaning the data has more extreme values and is more peaked than a normal distribution.

A negative kurtosis (platykurtic) indicates light tails and a flatter distribution, meaning the data has fewer extreme values and is flatter than a normal distribution.

Covariance

Covariance measures the degree to which two variables change together. It indicates whether the variables have a positive or negative linear relationship.

Formula: Cov(X, Y) = Σ((X - Mean(X)) * (Y - Mean(Y))) / (n - 1)

Where:

• Σ represents the summation symbol
• X and Y are variables
• Mean(X) and Mean(Y) are the means of X and Y, respectively
• n is the total number of observations

If the covariance is positive, it indicates a positive relationship (X tends to increase when Y increases).

If the covariance is negative, it indicates a negative relationship (X tends to decrease when Y increases).

Correlation Coefficient (Pearson's r)

The correlation coefficient measures the strength and direction of the linear relationship between two variables. It is a normalized version of covariance that ranges from -1 to 1.

Formula: r = Cov(X, Y) / (Standard Deviation(X) * Standard Deviation(Y))

Where:

• Cov(X, Y) is the covariance between X and Y
• Standard Deviation(X) and Standard Deviation(Y) are the standard deviations of X and Y, respectively

If |r| is close to 1, it indicates a strong linear relationship, with positive r indicating a positive correlation and negative r indicating a negative correlation. If |r| is close to 0, it indicates a weak or no linear relationship.

One request?

I’ve put so much effort writing this blog post to provide value to you. It’ll be very helpful for me, if you consider sharing it on social media or with your friends/family. SHARING IS ♥️

What do you think?
7
7
9
7
6
8