Skip to content

Maths

Statistics is a branch of mathematics dealing with data collection, analysis, interpretation, and presentation. It provides tools for making informed decisions based on data.

  • Descriptive Statistics: Summarizes data using measures like mean, median, mode, and standard deviation.
  • Inferential Statistics: Makes predictions or inferences about a population based on a sample of data.
  • Population 📊

    • Complete dataset
    • Example: All students in a university
    • N = Total size
  • Sample 🔍

    • Subset of population
    • Example: 100 randomly selected students
    • n = Sample size

Key Point: Sample should be representative of the population

  • Measures of Central Tendency are used to describe the central or typical value of a dataset.

  • Mean:

    • Average of all values in a dataset
    • Sample Formula: x̄ = (∑x) / n
    • Population Formula: μ = (∑x) / N
    • Example: For [2,4,6,8], Mean = (2+4+6+8)/4 = 5
    • Use when:
      • Data is normally distributed
      • Need a value affected by all data points
    • Limitations:
      • Sensitive to outliers
      • May not represent central value in skewed data
  • Median:

    • Middle value when data is ordered
    • Formula:
      • Odd n: Value at position (n+1)/2
      • Even n: Average of values at n/2 and (n/2)+1
    • Example: For [1,3,5,7,9], Median = 5
    • Use when:
      • Data has outliers
      • Distribution is skewed
    • Limitations:
      • Not influenced by all values
      • Changes less smoothly than mean
  • Mode:

    • Most frequently occurring value
    • Formula: Value with highest frequency
    • Example: For [1,2,2,3,4], Mode = 2
    • Use when:
      • Working with categorical data
      • Need most common value
    • Limitations:
      • Can have multiple modes
      • May not exist if all values occur once

Measures of dispersion describe how spread out data points are from the center. They’re crucial for:

  • Understanding data variability
  • Assessing data reliability
  • Comparing datasets
  • Variance (σ²)

    • Average squared deviation from mean
    • Formula: σ² = Σ(x - μ)²/n
    • Use when:
      • Detailed spread analysis needed
      • Computing statistical tests
    • Limitation: Units are squared
  • Standard Deviation (σ)

    • Square root of variance
    • Formula: σ = √(Σ(x - μ)²/n)
    • Use when:
      • Need spread in original units
      • Analyzing normal distributions
      • Building ML models

Variance measures the average squared distance of data points from their mean, indicating data spread and variability.

  • Definition: Average of squared deviations from mean
  • Population Formula: σ² = Σ(x - μ)²/N
  • Sample Formula: s² = Σ(x - x̄)²/(n-1)

The use of (n-1) instead of n in sample variance is called Bessel’s Correction. Here’s why it matters:

  1. Degrees of Freedom

    • When calculating sample variance, we lose one degree of freedom
    • This happens because we already used one piece of information (sample mean)
    • n-1 accounts for this lost degree of freedom
  2. Bias Correction

    • Sample variance with n tends to underestimate population variance
    • Using n-1 makes the estimator unbiased
    • Formula: s² = Σ(x - x̄)²/(n-1)
  3. Practical Impact

    • More noticeable in small samples
    • Example:
      • n=5: 20% difference
      • n=100: 1% difference
    • Critical for accurate statistical inference
Real-World Example of n-1 in Sample Variance
Section titled “Real-World Example of n-1 in Sample Variance”

Imagine a battery manufacturing plant:

Population Data:

  • All 1000 batteries produced in a day
  • True population mean (μ) = 1.5V
  • Population readings vary between 1.3V to 1.7V
  • True population variance = 0.024V²

Sample Test:

  • We test only 5 batteries: [1.3V, 1.4V, 1.5V, 1.6V, 1.7V]
  • Sample mean (x̄) = 1.5V

Variance Calculations:

# Using n (biased)
Variance_n = Σ(x -/5
= [(1.3-1.5+ (1.4-1.5+ (1.5-1.5+ (1.6-1.5+ (1.7-1.5]/5
= 0.02V² # Underestimates true variance (0.024V²)
# Using n-1 (unbiased)
Variance_n1 = Σ(x -/4
= 0.025V² # Closer to true variance (0.024V²)

Why This Matters:

  • Using n: 0.02V² (off by 0.004V²)
  • Using n-1: 0.025V² (off by 0.001V²)
  • n-1 gives estimate closer to true population variance

This example shows how n-1 provides a better estimate of the true population variance. Note that in real situations, we usually don’t know the true population variance - that’s why we need good estimation methods.

Key Point: Use n-1 for sample variance to get an unbiased estimate of population variance

# Sample of 100 people's weights
mean = 70kg
variance = 25kg² # Standard deviation ≈ 5kg
# Practical Use
size_range = mean ± (2 × √variance)
# = 70 ± 10kg
# = 60kg to 80kg
# Example with weight data
mean = 70kg
std_dev = 5kg
# Coverage ranges
range = 70 ± 5kg = 65-75kg (covers 68%)
range = 70 ± 10kg = 60-80kg (covers 95%) # Most commonly used
range = 70 ± 15kg = 55-85kg (covers 99.7%)

Variables are characteristics that can be measured or categorized. They come in different types:

  • Nominal

    • Categories with no order
    • Example: Colors (red, blue), Gender (male, female)
    • Analysis: Mode, frequency
  • Ordinal

    • Categories with order
    • Example: Education (high school, bachelor’s, master’s)
    • Analysis: Median, percentiles
  • Discrete

    • Countable values
    • Example: Number of children, Test score
    • Analysis: Mean, standard deviation
  • Continuous

    • Infinite possible values
    • Example: Height, Weight, Time
    • Analysis: Mean, standard deviation, correlation
  • Independent Variable (X)

    • Manipulated/controlled variable
    • Example: Study hours
  • Dependent Variable (Y)

    • Outcome variable
    • Example: Test score

Note: Variable type determines which statistical methods to use

A random variable is a function that assigns numerical values to outcomes of a random experiment.

  1. Discrete Random Variables

    • Takes countable/finite values
    • Examples:
      • Number of heads in coin flips
      • Count of defective items
    • Properties:
      • Probability Mass Function (PMF)
      • Cumulative Distribution Function (CDF)
  2. Continuous Random Variables

    • Takes infinite possible values
    • Examples:
      • Height of a person
      • Time to complete a task
    • Properties:
      • Probability Density Function (PDF)
      • Cumulative Distribution Function (CDF)

A histogram is a graphical representation of the distribution of data. It shows the frequency of each data point in a dataset.

  • Definition: Bar chart showing frequency of data points
  • Purpose: Visualize data distribution
  • Components:
    • X-axis: Data range (bins)
    • Y-axis: Frequency (count or percentage)
    • Bars: Represents frequency of data in each bin
import numpy as np
import matplotlib.pyplot as plt
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
plt.figure(figsize=(8, 4))
plt.hist(data, bins=5, edgecolor='black')
plt.title('Simple Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

Histogram Output

Percentiles and quartiles are measures that divide a dataset into equal portions.

  • Divides data into 100 equal parts
  • Pth percentile: Value below which P% of observations fall
  • Common uses:
    • 50th percentile = median
    • Used in standardized testing (e.g., “90th percentile”)
  • Divides data into 4 equal parts
  • Q1 (25th percentile): First quartile
  • Q2 (50th percentile): Median
  • Q3 (75th percentile): Third quartile
  • IQR (Interquartile Range) = Q3 - Q1
import numpy as np
data = [2, 4, 6, 8, 10, 12, 14, 16]
# Quartiles
Q1 = np.percentile(data, 25) # = 5
Q2 = np.percentile(data, 50) # = 9
Q3 = np.percentile(data, 75) # = 13
IQR = Q3 - Q1 # = 8
# Any percentile
p90 = np.percentile(data, 90) # 14.6

The 5 number summary is a quick way to describe the distribution of a dataset. It consists of the minimum, first quartile, median, third quartile, and maximum.

  • Minimum: Smallest value in the dataset
  • First Quartile (Q1): 25th percentile
  • Median: 50th percentile
  • Third Quartile (Q3): 75th percentile
  • Maximum: Largest value in the dataset
import numpy as np
data = [2, 4, 6, 8, 10, 12, 14, 16]
# 5 Number Summary
summary = np.percentile(data, [0, 25, 50, 75, 100])
print(summary) # [2. 5. 9. 13. 16.]

Outliers are values that fall outside the range of the 5 number summary.

  • Lower Outlier: Below (Q1 - 1.5 * IQR)
  • Upper Outlier: Above (Q3 + 1.5 * IQR)
  • Interquartile Range (IQR) = Q3 - Q1
import numpy as np
# Sample dataset
data = [1, 2, 2, 3, 4, 10, 20, 25, 30]
# Calculate 5 number summary
min_val = np.min(data)
q1 = np.percentile(data, 25)
median = np.percentile(data, 50)
q3 = np.percentile(data, 75)
max_val = np.max(data)
# Calculate IQR and outlier bounds
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
# Find outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
print(f"5 Number Summary:")
print(f"Min: {min_val}")
print(f"Q1: {q1}")
print(f"Median: {median}")
print(f"Q3: {q3}")
print(f"Max: {max_val}")
print(f"Outliers: {outliers}")

Covariance measures how two variables change together. It indicates the direction of the linear relationship between variables.

  • Population Covariance: σxy = Σ((x - μx)(y - μy))/N
  • Sample Covariance: sxy = Σ((x - x̄)(y - ȳ))/(n-1)
  • Positive covariance: Variables tend to move in same direction
  • Negative covariance: Variables tend to move in opposite directions
  • Zero covariance: No linear relationship
import numpy as np
# Height (cm) and Weight (kg) data
height = np.array([170, 175, 160, 180, 165, 172])
weight = np.array([65, 70, 55, 80, 60, 68])
# Calculate means
height_mean = np.mean(height)
weight_mean = np.mean(weight)
# Calculate covariance manually
n = len(height)
covariance = sum((height - height_mean) * (weight - weight_mean)) / (n-1)
# Using NumPy
cov_matrix = np.cov(height, weight)
covariance_np = cov_matrix[0,1]
print(f"Manual Covariance: {covariance:.2f}")
print(f"NumPy Covariance: {covariance_np:.2f}")
# Output:
# Manual Covariance: 58.17
# NumPy Covariance: 58.17
  1. Covariance range: -∞ to +∞
  2. Scale-dependent (affected by units)
  3. Used in:
    • Principal Component Analysis
    • Portfolio optimization
    • Feature selection in ML
  1. Not standardized (hard to compare)
  2. Units are product of both variables

For standardized measurement, use correlation instead.

Correlation measures the strength and direction of the linear relationship between two variables. Unlike covariance, it’s standardized between -1 and +1.

The most common correlation measure is Pearson’s r:

  • Population Correlation: ρxy = Σ((x - μx)(y - μy))/(σxσy)
  • Sample Correlation: r = Σ((x - x̄)(y - ȳ))/√[Σ(x - x̄)²Σ(y - ȳ)²] = Cov(x,y)/(σxσy)
  • r = 1: Perfect positive correlation
  • r = -1: Perfect negative correlation
  • r = 0: No linear correlation
  • |r| > 0.7: Strong correlation
  • 0.3 < |r| < 0.7: Moderate correlation
  • |r| < 0.3: Weak correlation
import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.array([1, 2, 3, 4, 5])
y1 = np.array([2, 4, 6, 8, 10]) # Perfect positive
y2 = np.array([10, 8, 6, 4, 2]) # Perfect negative
y3 = np.array([2, 5, 4, 5, 3]) # Weak correlation
def plot_correlation(x, y, title):
plt.scatter(x, y)
plt.title(f'{title} (r = {np.corrcoef(x,y)[0,1]:.2f})')
plt.xlabel('X')
plt.ylabel('Y')
# Create subplots
plt.figure(figsize=(15, 5))
plt.subplot(131)
plot_correlation(x, y1, "Perfect Positive")
plt.subplot(132)
plot_correlation(x, y2, "Perfect Negative")
plt.subplot(133)
plot_correlation(x, y3, "Weak")
plt.tight_layout()
plt.show()
# Calculate correlations
print(f"Positive correlation: {np.corrcoef(x,y1)[0,1]:.2f}") # 1.00
print(f"Negative correlation: {np.corrcoef(x,y2)[0,1]:.2f}") # -1.00
print(f"Weak correlation: {np.corrcoef(x,y3)[0,1]:.2f}") # 0.24

Correlation

  1. Scale-independent (standardized)
  2. Always between -1 and +1
  3. No units
  4. Symmetric: corr(x,y) = corr(y,x)
  • Feature selection in ML
  • Financial portfolio analysis
  • Scientific research
  • Quality control
  1. Only measures linear relationships
  2. Sensitive to outliers
  3. Correlation ≠ causation
  4. Requires numeric data
import pandas as pd
# Student data
data = {
'study_hours': [2, 3, 3, 4, 4, 5, 5, 6],
'test_score': [65, 70, 75, 80, 85, 85, 90, 95]
}
df = pd.DataFrame(data)
# Calculate correlation
correlation = df['study_hours'].corr(df['test_score'])
print(f"Correlation between study hours and test scores: {correlation:.2f}")
# Output: Correlation between study hours and test scores: 0.97
  1. High correlation doesn’t imply causation
  2. Always visualize data - don’t rely solely on correlation coefficient
  3. Consider non-linear relationships
  4. Check for outliers that might affect correlation

Spearman correlation measures monotonic relationships between variables (whether they move in the same direction, regardless of the rate of change).

  1. First convert values to ranks

    • rank(x): Rank values of first variable
    • rank(y): Rank values of second variable
  2. Calculate differences d = rank(x) - rank(y)

  3. Square the differences d² = (rank(x) - rank(y))²

  4. Sum all squared differences Σd² = sum of all d²

  5. Final formula: ρ = 1 - (6 * Σd²)/(n(n² - 1))

Example with numbers: x = [1,2,3] y = [2,1,3]

ranks_x = [1,2,3] ranks_y = [2,1,3]

d = [1-2, 2-1, 3-3] = [-1,1,0] d² = [1,1,0] Σd² = 2 n = 3

ρ = 1 - (6 * 2)/(3(9-1)) = 1 - 12/24 = 0.5

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Online Store Example
# X: Product Price
# Y: Number of Sales
price = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
sales = np.array([100, 90, 80, 65, 45, 30, 20, 15, 12, 10])
# Calculate Correlations
pearson = stats.pearsonr(price, sales)[0]
spearman = stats.spearmanr(price, sales)[0]
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(price, sales)
plt.title('Price vs Sales: Non-linear Relationship')
plt.xlabel('Price ($)')
plt.ylabel('Number of Sales')
# Add correlation values
plt.text(20, 30, f'Pearson r: {pearson:.2f}')
plt.text(20, 20, f'Spearman r: {spearman:.2f}')
plt.grid(True)
plt.show()
print(f"Pearson Correlation: {pearson:.2f}") # -0.89
print(f"Spearman Correlation: {spearman:.2f}") # -1.00

Spearman Correlation

Probability is a measure of the likelihood of an event occurring. It’s a fundamental concept in statistics and data science.

Events that cannot occur at the same time.

Formula: P(A or B) = P(A) + P(B)

Example:

  • Rolling a die:
    • P(getting 1 or 2) = P(1) + P(2) = 1/6 + 1/6 = 1/3
    • Events are mutually exclusive since you can’t roll 1 and 2 simultaneously

Events that can occur at the same time.

Formula: P(A or B) = P(A) + P(B) - P(A and B)

Example:

  • Drawing a card:
    • P(getting King or Heart) = P(King) + P(Heart) - P(King of Heart)
    • = 4/52 + 13/52 - 1/52 = 16/52
    • Events overlap since King of Hearts is possible

The multiplicative rule calculates probability of multiple events occurring together.

Events where occurrence of one doesn’t affect the other.

Formula: P(A and B) = P(A) × P(B)

Example:

  • Flipping a coin twice:
    • P(2 heads) = P(head1) × P(head2) = 1/2 × 1/2 = 1/4

Events where occurrence of one affects the other.

Formula: P(A and B) = P(A) × P(B|A)

Example:

  • Drawing 2 cards without replacement:
    • P(2 aces) = P(ace1) × P(ace2|ace1)
    • = 4/52 × 3/51 = 1/221

A PMF describes the probability distribution of a discrete random variable.

  • Maps each value of discrete random variable to its probability
  • P(X = x) gives probability of X taking value x
  • Sum of all probabilities must equal 1
  • 0 ≤ P(X = x) ≤ 1
  • ∑P(X = x) = 1
  • Only for discrete variables
import numpy as np
import matplotlib.pyplot as plt
# Fair die PMF
outcomes = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.array([1/6] * 6)
# Loaded die PMF (favors 6)
loaded_prob = np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.5])
# Plot
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.bar(outcomes, probabilities)
plt.title('Fair Die PMF')
plt.xlabel('Outcome')
plt.ylabel('Probability')
plt.subplot(1, 2, 2)
plt.bar(outcomes, loaded_prob)
plt.title('Loaded Die PMF')
plt.xlabel('Outcome')
plt.show()
# Calculate probability of rolling even numbers
fair_even = sum(probabilities[1::2]) # 0.5
loaded_even = sum(loaded_prob[1::2]) # 0.7
print(f"P(Even) Fair Die: {fair_even}") # 0.5
print(f"P(Even) Loaded Die: {loaded_even}") # 0.7

PMF

PMF helps in making probability-based decisions in discrete scenarios like manufacturing defects, customer counts, or game outcomes.

CDF (Cumulative Distribution Function) For Discrete Variables

Section titled “CDF (Cumulative Distribution Function) For Discrete Variables”

CDF gives the probability that a random variable X is less than or equal to a value x.

F(x) = P(X ≤ x) = ∑ P(X = t) for all t ≤ x

import numpy as np
import matplotlib.pyplot as plt
# Define probabilities
outcomes = np.arange(1, 7) # [1,2,3,4,5,6]
fair_prob = np.ones(6) / 6 # [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
loaded_prob = np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.5])
# Calculate CDFs
fair_cdf = np.cumsum(fair_prob)
loaded_cdf = np.cumsum(loaded_prob)
# Create plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
# Plot fair die CDF
ax1.step(outcomes, fair_cdf, where='post')
ax1.set(title='Fair Die CDF', xlabel='Outcome', ylabel='Cumulative Probability')
ax1.grid(True)
# Plot loaded die CDF
ax2.step(outcomes, loaded_cdf, where='post')
ax2.set(title='Loaded Die CDF', xlabel='Outcome')
ax2.grid(True)
plt.tight_layout()
plt.show()
# Print probabilities for X ≤ 4
print(f"P(X ≤ 4) Fair Die: {fair_cdf[3]:.3f}") # 0.667
print(f"P(X ≤ 4) Loaded Die: {loaded_cdf[3]:.3f}") # 0.400

Discrete CDF

A PDF describes the probability distribution of a continuous random variable.

  1. Impossible to List All Values

    • Continuous variables have infinite possible values
    • Can’t assign individual probabilities like PMF
  2. Zero Individual Probability

    • P(X = x) = 0 for any exact value
    • Example: P(height = exactly 170.000000…cm) = 0
  3. Range Probabilities

    • PDF helps calculate probability over intervals
    • P(a ≤ X ≤ b) = ∫[a to b] f(x)dx
    • Example: P(170 ≤ height ≤ 175)
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Height Distribution in Adult Population
mean_height = 170 # cm
std_dev = 10 # cm
# Create height range
heights = np.linspace(140, 200, 100)
# Calculate PDF using normal distribution
pdf = norm.pdf(heights, mean_height, std_dev)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(heights, pdf)
plt.title('Height Distribution in Adult Population')
plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
# Calculate probabilities
# Probability of height between 160-180cm
prob_160_180 = norm.cdf(180, mean_height, std_dev) - norm.cdf(160, mean_height, std_dev)
print(f"Probability of height between 160-180cm: {prob_160_180:.2%}") # ≈ 68%
# Probability of height above 190cm
prob_above_190 = 1 - norm.cdf(190, mean_height, std_dev)
print(f"Probability of height above 190cm: {prob_above_190:.2%}") # ≈ 2.3%

Continuous PDF

Density in statistics refers to how tightly packed data points or probability is in a given interval or region.

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
mean = 170
std = 10
# Single point density doesn't tell much
point_density = norm.pdf(170, mean, std)
print(f"Density at 170cm: {point_density:.4f}") # 0.0399
# This number 0.0399 alone is meaningless without context
# What makes sense is comparing densities
heights = [150, 160, 170, 180, 190]
densities = [norm.pdf(h, mean, std) for h in heights]
plt.figure(figsize=(10, 6))
x = np.linspace(140, 200, 1000)
y = norm.pdf(x, mean, std)
plt.plot(x, y)
# Plot points for comparison
for h, d in zip(heights, densities):
plt.plot(h, d, 'o', label=f'Height={h}cm\nDensity={d:.4f}')
plt.title('Height Distribution - Comparing Densities')
plt.xlabel('Height (cm)')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
# Now we can see:
# - 170cm has highest density (most common)
# - 150cm and 190cm have low density (less common)
# - Comparison gives meaning to the numbers

Density

CDF (Cumulative Distribution Function) For Continuous Variables

Section titled “CDF (Cumulative Distribution Function) For Continuous Variables”

The Cumulative Distribution Function (CDF) for a continuous random variable X, denoted as F(x), represents the probability that X takes on a value less than or equal to x.

F(x) = P(X ≤ x) = ∫[from -∞ to x] f(t)dt

where f(t) is the probability density function (PDF)

  1. Bounds

    • 0 ≤ F(x) ≤ 1 for all x
    • lim[x→-∞] F(x) = 0
    • lim[x→∞] F(x) = 1
  2. Continuity

    • Right-continuous
    • Monotonically increasing (never decreases)
  3. Probability Calculations

    • P(a < X ≤ b) = F(b) - F(a)
    • P(X > a) = 1 - F(a)
  4. Relationship to PDF

    • F’(x) = f(x) (derivative of CDF is PDF)
    • F(x) is the integral of f(x)
Simple Example: Height Distribution in a Class
Section titled “Simple Example: Height Distribution in a Class”

Consider heights of students in a class of 100:

Scenario:

  • Heights range from 150cm to 190cm
  • CDF tells us probability of height being less than or equal to a value

Simple Interpretations:

  • F(160) = 0.2 means 20% of students are 160cm or shorter
  • F(170) = 0.5 means 50% of students are 170cm or shorter
  • F(180) = 0.9 means 90% of students are 180cm or shorter

Practical Uses:

  • Finding median height: Where F(x) = 0.5
  • Ordering uniforms: What size covers 80% of students
  • Identifying unusually tall/short: Heights where F(x) < 0.1 or F(x) > 0.9
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Parameters
mean_height = 170 # mean height in cm
std_dev = 10 # standard deviation in cm
n_students = 100
# Generate student heights
np.random.seed(42) # for reproducibility
heights = np.random.normal(mean_height, std_dev, n_students)
# Calculate empirical CDF
heights_sorted = np.sort(heights)
cumulative_prob = np.arange(1, len(heights) + 1) / len(heights)
# Plot
plt.figure(figsize=(10, 6))
# Empirical CDF
plt.plot(heights_sorted, cumulative_prob, 'b-', label='Empirical CDF')
# Add reference lines
plt.axhline(y=0.2, color='r', linestyle='--', alpha=0.5)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
plt.axhline(y=0.9, color='r', linestyle='--', alpha=0.5)
# Labels and title
plt.title('Height Distribution CDF in Class')
plt.xlabel('Height (cm)')
plt.ylabel('Cumulative Probability')
plt.grid(True)
# Find specific values
height_20 = np.percentile(heights, 20)
height_50 = np.percentile(heights, 50)
height_90 = np.percentile(heights, 90)
print(f"20th percentile (F(x) = 0.2): {height_20:.1f}cm") # 162.6cm
print(f"50th percentile (F(x) = 0.5): {height_50:.1f}cm") # 168.7cm
print(f"90th percentile (F(x) = 0.9): {height_90:.1f}cm") # 180.1cm
plt.show()

Continuous CDF

The Bernoulli distribution models binary outcomes - experiments with exactly two possible results (success/failure).

  • Parameter: p (probability of success)
  • Possible Values: x ∈ 1
    • x = 1 (success): probability = p
    • x = 0 (failure): probability = 1-p
  • PMF: P(X = x) = p^x * (1-p)^(1-x)
  • Mean: E(X) = p
  • Variance: Var(X) = p(1-p)
  • Coin flips (heads/tails)
  • Quality control (defective/non-defective)
  • Email (spam/not spam)
  • Medical tests (positive/negative)
import numpy as np
import matplotlib.pyplot as plt
class BernoulliTrial:
def __init__(self, p):
self.p = p
def pmf(self, x):
return self.p if x == 1 else (1-self.p)
def simulate(self, n_trials):
return np.random.binomial(n=1, p=self.p, size=n_trials)
# Example: Biased coin (p=0.7)
b = BernoulliTrial(p=0.7)
trials = b.simulate(1000)
# Results
success_rate = np.mean(trials)
print(f"Theoretical probability: 0.7")
print(f"Observed probability: {success_rate:.3f}")
# Visualize
plt.figure(figsize=(8, 4))
plt.bar(['Failure (0)', 'Success (1)'],
[1-success_rate, success_rate])
plt.title('Bernoulli Distribution (p=0.7)')
plt.ylabel('Probability')
plt.ylim(0, 1)

Bernoulli Distribution

# Simple spam detector
class SpamDetector:
def __init__(self, spam_probability=0.3):
self.b = BernoulliTrial(spam_probability)
def classify_email(self):
return "Spam" if self.b.simulate(1)[0] else "Not Spam"
# Simulate email classification
detector = SpamDetector()
n_emails = 100
classifications = [detector.classify_email() for _ in range(n_emails)]
spam_ratio = classifications.count("Spam") / n_emails
print(f"Classified {spam_ratio:.1%} emails as spam")
// Classified 26.0% emails as spam
  1. Independence: Each trial is independent
  2. Memory-less: Previous outcomes don’t affect next trial
  3. Fixed probability: p remains constant across trials
  • Foundation for Binomial distribution (n Bernoulli trials)
  • Special case of Binomial where n=1
  • Building block for more complex probability models

The Binomial distribution models the number of successes in n independent Bernoulli trials.

  • Parameters:
    • n (number of trials)
    • p (probability of success)
  • PMF: P(X = k) = C(n,k) * p^k * (1-p)^(n-k)
  • Mean: E(X) = np
  • Variance: Var(X) = np(1-p)
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
class BinomialDistribution:
def __init__(self, n, p):
self.n = n
self.p = p
def pmf(self, k):
return binom.pmf(k, self.n, self.p)
def simulate(self, n_trials):
return np.random.binomial(self.n, self.p, n_trials)
# Example: Rolling a fair die 10 times, counting 6s
n, p = 10, 1/6 # 10 rolls, P(6) = 1/6
b = BinomialDistribution(n, p)
# Calculate PMF for all possible values
k = np.arange(0, n+1)
probabilities = [b.pmf(ki) for ki in k]
# Plot
plt.figure(figsize=(10, 5))
plt.bar(k, probabilities)
plt.title(f'Binomial Distribution (n={n}, p={p:.2f})')
plt.xlabel('Number of Successes (k)')
plt.ylabel('Probability')
plt.grid(True, alpha=0.3)
# Expected value and variance
mean = n * p
var = n * p * (1-p)
print(f"Expected number of 6s: {mean:.2f}") # 1.67
print(f"Variance: {var:.2f}") # 1.39

Binomial Distribution

# Manufacturing defect inspection
class QualityControl:
def __init__(self, batch_size=20, defect_rate=0.05):
self.binom = BinomialDistribution(batch_size, defect_rate)
def inspect_batch(self):
return self.binom.simulate(1)[0]
def is_batch_acceptable(self, max_defects=2):
defects = self.inspect_batch()
return {
'defects': defects,
'acceptable': defects <= max_defects
}
# Simulate batch inspections
qc = QualityControl()
n_batches = 1000
inspections = [qc.is_batch_acceptable() for _ in range(n_batches)]
acceptance_rate = sum(i['acceptable'] for i in inspections) / n_batches
print(f"Batch acceptance rate: {acceptance_rate:.1%}")
// Batch acceptance rate: 92.7%
  1. Quality control in manufacturing
  2. A/B testing success counts
  3. Survey response modeling
  4. Genetic inheritance patterns
  1. Sum of independent Bernoulli trials
  2. Requires fixed probability p
  3. Trials must be independent
  4. Only whole numbers (discrete)

The Poisson distribution models the number of events occurring in a fixed interval when these events happen at a constant average rate and independently of each other.

  • Parameter: λ (lambda) - average number of events per interval
  • PMF: P(X = k) = (λ^k * e^-λ) / k!
  • Mean: E(X) = λ
  • Variance: Var(X) = λ
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson
class PoissonDistribution:
def __init__(self, lambda_param):
self.lambda_param = lambda_param
def pmf(self, k):
return poisson.pmf(k, self.lambda_param)
def simulate(self, n_samples):
return np.random.poisson(self.lambda_param, n_samples)
# Example: Website visits per hour (average 5 visits)
lambda_param = 5
p = PoissonDistribution(lambda_param)
# Calculate PMF for values 0 to 12
k = np.arange(0, 13)
probabilities = [p.pmf(ki) for ki in k]
# Visualization
plt.figure(figsize=(10, 6))
plt.bar(k, probabilities)
plt.title(f'Poisson Distribution (λ={lambda_param})')
plt.xlabel('Number of Events (k)')
plt.ylabel('Probability')
plt.grid(True, alpha=0.3)

Poisson Distribution

  1. Customer Service

    • Number of customers arriving per hour
    • Support tickets received per day
    • Phone calls to call center
  2. Web Traffic

    • Page views per minute
    • Server requests per second
    • Error occurrences per day
  3. Quality Control

    • Defects per unit area
    • Flaws per length of material
    • Errors per page
  4. Natural Phenomena

    • Radioactive decay events
    • Mutations in DNA sequence
    • Natural disasters per year
class ServerMonitor:
def __init__(self, avg_requests_per_minute=30):
self.poisson = PoissonDistribution(avg_requests_per_minute)
def simulate_minute(self):
return self.poisson.simulate(1)[0]
def check_load(self, threshold=50):
requests = self.simulate_minute()
return {
'requests': requests,
'overloaded': requests > threshold,
'utilization': requests / threshold
}
# Monitor server for an hour
monitor = ServerMonitor()
hour_data = [monitor.check_load() for _ in range(60)]
# Analysis
overloaded_minutes = sum(minute['overloaded'] for minute in hour_data)
avg_utilization = np.mean([minute['utilization'] for minute in hour_data])
print(f"Minutes overloaded: {overloaded_minutes}") // Minutes overloaded: 0
print(f"Average utilization: {avg_utilization:.1%}") // Average utilization: 56.5%
  1. Independence

    • Events occur independently
    • Past events don’t influence future events
  2. Rate Consistency

    • Average rate (λ) remains constant
    • No systematic variation in event frequency
  3. Rare Events

    • Individual events are rare relative to opportunities
    • Many opportunities for events to occur
  4. No Upper Limit

    • Can theoretically take any non-negative integer value
    • Practical limits depend on λ
  1. Binomial Distribution

    • Poisson is limit of binomial as n���∞, p→0, np=λ
    • Used when events are rare but opportunities numerous
  2. Exponential Distribution

    • Time between Poisson events follows exponential distribution
    • If events are Poisson(λ), waiting times are Exp(1/λ)
  1. Rate Stability

    • Assumes constant average rate
    • May not fit if rate varies significantly
  2. Independence

    • Events must be independent
    • Not suitable for contagious or clustered events
  3. No Simultaneous Events

    • Events occur one at a time
    • May need modifications for concurrent events
  4. Memory-less Property

    • Future events independent of past
    • May not suit events with temporal dependencies
# Example: Testing Poisson assumptions
def test_rate_stability(data, window_size=10):
"""Test if event rate is stable over time"""
windows = np.array_split(data, len(data)//window_size)
means = [np.mean(w) for w in windows]
return np.std(means) / np.mean(means) # CV should be small
# Generate sample data
p = PoissonDistribution(lambda_param=5)
data = p.simulate(1000)
stability_metric = test_rate_stability(data)
print(f"Rate stability metric: {stability_metric:.3f}")
# Lower values indicate more stable rate
// Rate stability metric: 0.138

The Normal (or Gaussian) distribution is a continuous probability distribution that is symmetric around its mean, showing a characteristic “bell-shaped” curve.

  • Parameters:
    • μ (mean): Center of distribution
    • σ (standard deviation): Spread of distribution
  • PDF: f(x) = (1/(σ√(2π))) * e^(-(x-μ)²/(2σ²))
  • Mean = Median = Mode: All equal to μ
  • 68-95-99.7 Rule:
    • 68% of data within μ ± 1σ
    • 95% of data within μ ± 2σ
    • 99.7% of data within μ ± 3σ
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
class NormalDistribution:
def __init__(self, mu=0, sigma=1):
self.mu = mu
self.sigma = sigma
def pdf(self, x):
return norm.pdf(x, self.mu, self.sigma)
def simulate(self, n_samples):
return np.random.normal(self.mu, self.sigma, n_samples)
# Example: Height Distribution
mu, sigma = 170, 10 # mean=170cm, std=10cm
normal = NormalDistribution(mu, sigma)
# Generate x values and corresponding probabilities
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 100)
y = normal.pdf(x)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', label='PDF')
# Add standard deviation ranges
for i, pct in [(1, 0.68), (2, 0.95), (3, 0.997)]:
plt.fill_between(x, y,
where=(x >= mu-i*sigma) & (x <= mu+i*sigma),
alpha=0.2,
label=f{i}σ ({pct:.1%})')
plt.title('Normal Distribution of Heights')
plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)

Normal Distribution

  1. Physical Measurements

    • Height, weight
    • Manufacturing dimensions
    • Measurement errors
  2. Natural Phenomena

    • IQ scores
    • Blood pressure
    • Test scores
  3. Financial Markets

    • Stock returns
    • Price fluctuations
    • Risk modeling
class ProductionLine:
def __init__(self, target_length=100, tolerance=0.5):
self.normal = NormalDistribution(target_length, tolerance/3)
self.tolerance = tolerance
def produce_item(self):
length = self.normal.simulate(1)[0]
return {
'length': length,
'in_spec': abs(length - self.normal.mu) <= self.tolerance
}
def analyze_batch(self, size=1000):
batch = [self.produce_item() for _ in range(size)]
defect_rate = 1 - sum(item['in_spec'] for item in batch) / size
return f"Defect rate: {defect_rate:.2%}"
# Simulate production
line = ProductionLine()
print(line.analyze_batch()) # Expected ≈ 0.20% defect rate

Z-score measures how many standard deviations away from the mean a data point is:

  • Formula: z = (x - μ) / σ
  • Standardizes any normal distribution to N(0,1)
  • Useful for comparing values from different distributions
def calculate_z_score(x, mu, sigma):
return (x - mu) / sigma
# Example
height = 185 # cm
z = calculate_z_score(height, mu=170, sigma=10)
print(f"Z-score for {height}cm: {z:.2f}") # 1.50
print(f"Percentile: {norm.cdf(z):.2%}") # 93.32%

The CLT states that the sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the underlying distribution:

def demonstrate_clt(distribution, sample_size, n_samples):
means = [np.mean(distribution(sample_size))
for _ in range(n_samples)]
return means
# Example with uniform distribution
uniform_samples = lambda n: np.random.uniform(0, 1, n)
sample_means = demonstrate_clt(uniform_samples, 30, 1000)
plt.figure(figsize=(10, 6))
plt.hist(sample_means, bins=30, density=True)
plt.title('Sampling Distribution of Mean (n=30)')
plt.xlabel('Sample Mean')
plt.ylabel('Density')

Central Limit Theorem

  1. Symmetry

    • Perfectly symmetric around mean
    • Skewness = 0
    • Kurtosis = 3
  2. Standardization

    • Any normal distribution can be standardized
    • Z-scores allow comparison across distributions
  3. Empirical Rule

    • 68-95-99.7 rule for data distribution
    • Useful for quick probability estimates
  4. Properties

    • Sum of normal variables is normal
    • Linear combination of normal variables is normal
    • Independent of sample size (unlike t-distribution)

The Standard Normal Distribution is a special case of the normal distribution where μ = 0 and σ = 1. It’s often denoted as N(0,1) and serves as a reference distribution.

  • Mean (μ) = 0
  • Standard Deviation (σ) = 1
  • PDF: f(z) = (1/√(2π)) * e^(-z²/2)
  • Symmetric around zero
  • Total area = 1
  • z = ±1: 68% of data
  • z = ±2: 95% of data
  • z = ±3: 99.7% of data
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def plot_standard_normal():
z = np.linspace(-4, 4, 100)
pdf = norm.pdf(z, 0, 1)
plt.figure(figsize=(10, 6))
plt.plot(z, pdf, 'b-', label='PDF')
# Shade regions
colors = ['red', 'blue', 'green']
alphas = [0.1, 0.1, 0.1]
for i, (c, a) in enumerate(zip(colors, alphas), 1):
mask = (z >= -i) & (z <= i)
plt.fill_between(z[mask], pdf[mask], color=c, alpha=a,
label=f{i}σ ({norm.cdf(i)-norm.cdf(-i):.1%})')
plt.title('Standard Normal Distribution')
plt.xlabel('Z-Score')
plt.ylabel('Probability Density')
plt.grid(True)
plt.legend()
return plt
# Example usage
plot = plot_standard_normal()
plt.show()

Standard Normal Distribution

# Key probability points
z_scores = {
0.90: 1.28, # 90% confidence
0.95: 1.96, # 95% confidence
0.99: 2.58 # 99% confidence
}
# Example: Finding probabilities
def get_z_probability(z):
return norm.cdf(z) - norm.cdf(-z)
print(f"P(-1 < Z < 1): {get_z_probability(1):.4f}") # 0.6827
print(f"P(-2 < Z < 2): {get_z_probability(2):.4f}") # 0.9545
print(f"P(-3 < Z < 3): {get_z_probability(3):.4f}") # 0.9973
  1. Standardization
    def standardize(x, mu, sigma):
    return (x - mu) / sigma

scores = [75, 82, 90, 68, 95] mu = np.mean(scores) sigma = np.std(scores) z_scores = [standardize(x, mu, sigma) for x in scores]

2. **Hypothesis Testing**
```python
def z_test(sample_mean, pop_mean, pop_std, n):
z = (sample_mean - pop_mean) / (pop_std / np.sqrt(n))
p_value = 2 * (1 - norm.cdf(abs(z))) # Two-tailed
return z, p_value
  1. Confidence Intervals
    def confidence_interval(mean, std, n, confidence=0.95):
    z = norm.ppf((1 + confidence) / 2)
    margin = z * (std / np.sqrt(n))
    return mean - margin, mean + margin
  1. Reference for normalizing data
  2. Base for statistical inference
  3. Quality control limits
  4. Risk assessment
  5. Hypothesis testing

To convert between normal distributions:

  • From N(μ,σ) to N(0,1): Z = (X - μ) / σ
  • From N(0,1) to N(μ,σ): X = Zσ + μ
# Example: Converting between distributions
def transform_distribution(x, from_params, to_params):
"""
Transform value between normal distributions
from_params: tuple of (mean, std) of original distribution
to_params: tuple of (mean, std) of target distribution
"""
from_mean, from_std = from_params
to_mean, to_std = to_params
# First standardize
z = (x - from_mean) / from_std
# Then transform to new distribution
return z * to_std + to_mean
# Example usage
x = 85 # Score from N(75, 10)
new_x = transform_distribution(x, (75, 10), (100, 15))
print(f"Score of {x} transforms to {new_x:.1f}")
// Score of 85 transforms to 115.0

The Uniform Distribution is a probability distribution where all outcomes in a given interval are equally likely to occur. It comes in two forms: discrete and continuous.

  1. Continuous Uniform Distribution

    • Defined over continuous interval [a,b]
    • Also called “rectangular distribution”
    • Every point in interval has equal probability density
  2. Discrete Uniform Distribution

    • Defined over finite set of equally spaced values
    • Each value has equal probability
    • Example: Fair die (values 1-6)
  1. Continuous Uniform

    • PDF: f(x) = 1/(b-a) for a ≤ x ≤ b, 0 otherwise
    • CDF: F(x) = (x-a)/(b-a) for a ≤ x ≤ b
    • Mean: μ = (a + b)/2
    • Variance: σ² = (b - a)²/12
    • Median: (a + b)/2
    • Mode: Any value in [a,b]
    • Skewness: 0 (symmetric)
    • Kurtosis: 9/5 (platykurtic)
  2. Discrete Uniform

    • PMF: P(X = x) = 1/n for each x in {x₁, ..., xₙ}
    • Mean: (x₁ + xₙ)/2
    • Variance: (n² - 1)/12 where n is number of values
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import uniform
class UniformDistribution:
def __init__(self, a=0, b=1):
self.a = a
self.b = b
self.mean = (a + b) / 2
self.variance = (b - a)**2 / 12
self.std = np.sqrt(self.variance)
def pdf(self, x):
"""Probability Density Function"""
return np.where((x >= self.a) & (x <= self.b),
1/(self.b - self.a), 0)
def cdf(self, x):
"""Cumulative Distribution Function"""
return np.clip((x - self.a)/(self.b - self.a), 0, 1)
def simulate(self, n_samples):
"""Generate random samples"""
return np.random.uniform(self.a, self.b, n_samples)
# Visualization of PDF and CDF
def plot_uniform_distribution(a=0, b=1):
dist = UniformDistribution(a, b)
x = np.linspace(a-0.5, b+0.5, 1000)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# PDF
ax1.plot(x, dist.pdf(x))
ax1.fill_between(x, dist.pdf(x), alpha=0.3)
ax1.set_title('Probability Density Function')
ax1.grid(True)
# CDF
ax2.plot(x, dist.cdf(x))
ax2.set_title('Cumulative Distribution Function')
ax2.grid(True)
plt.tight_layout()
plt.show()
plot_uniform_distribution(0, 1)

Uniform Distribution

  1. Random Number Generation
def random_number_generator(a, b, n=1):
"""Generate n random numbers between a and b"""
uniform = UniformDistribution(a, b)
return uniform.simulate(n)
# Generate 5 random numbers between 0 and 10
numbers = random_number_generator(0, 10, 5)
print(f"Random numbers: {numbers}")
  1. Simulation of Wait Times
class ServiceSimulator:
def __init__(self, min_time=5, max_time=15):
self.uniform = UniformDistribution(min_time, max_time)
def simulate_service_times(self, n_customers):
times = self.uniform.simulate(n_customers)
return {
'times': times,
'average_wait': np.mean(times),
'total_time': np.sum(times)
}
# Simulate service times for 10 customers
simulator = ServiceSimulator()
results = simulator.simulate_service_times(10)
print(f"Average wait time: {results['average_wait']:.2f} minutes")
// Average wait time: 10.21 minutes
  1. Quality Control Bounds
class QualityControl:
def __init__(self, target, tolerance):
self.lower = target - tolerance
self.upper = target + tolerance
self.uniform = UniformDistribution(self.lower, self.upper)
def check_production(self, n_items):
measurements = self.uniform.simulate(n_items)
in_spec = np.logical_and(
measurements >= self.lower,
measurements <= self.upper
)
return {
'measurements': measurements,
'pass_rate': np.mean(in_spec),
'failures': np.sum(~in_spec)
# Check 100 items with target 10 and tolerance ±0.5
qc = QualityControl(target=10, tolerance=0.5)
inspection = qc.check_production(100)
print(f"Pass rate: {inspection['pass_rate']:.1%}")
// Pass rate: 100.0%
  1. Sum of Uniform Variables

    • Sum tends toward normal distribution (CLT)
    • Special case: Irwin-Hall distribution
  2. Order Statistics

    • Minimum: a + (b-a)U₁/₍ₙ₊₁₎
    • Maximum: a + (b-a)Uₙ/₍ₙ₊₁₎
    • Where U follows Beta distribution
def demonstrate_sum_convergence(n_vars=12, n_samples=1000):
"""Demonstrate convergence to normal as we sum uniforms"""
sums = np.sum([np.random.uniform(0, 1, n_samples)
for _ in range(n_vars)], axis=0)
plt.figure(figsize=(8, 4))
plt.hist(sums, bins=30, density=True)
plt.title(f'Sum of {n_vars} Uniform Variables')
plt.xlabel('Sum')
plt.ylabel('Density')
return plt
demonstrate_sum_convergence()

Sum of Uniform Variables

Hypothesis Testing with Uniform Distribution
Section titled “Hypothesis Testing with Uniform Distribution”

The uniform distribution is often used in:

  1. Testing random number generators
  2. Goodness-of-fit tests
  3. P-value calculations
def test_uniformity(data, alpha=0.05):
"""
Kolmogorov-Smirnov test for uniformity
"""
from scipy import stats
ks_stat, p_value = stats.kstest(data, 'uniform')
return {
'statistic': ks_stat,
'p_value': p_value,
'uniform': p_value > alpha
}
# Test random numbers for uniformity
data = np.random.uniform(0, 1, 1000)
results = test_uniformity(data)
print(f"Uniformity test p-value: {results['p_value']:.3f}")
// Uniformity test p-value: 0.999
  1. When to Use

    • Random sampling
    • Simple probability models
    • Null hypothesis testing
    • Initial approximations
  2. Limitations

    • Assumes equal probability
    • May oversimplify real phenomena
    • Sensitive to interval bounds
  3. Common Mistakes

    • Assuming uniformity without testing
    • Ignoring boundary effects
    • Misinterpreting discrete vs continuous

A Log Normal Distribution describes data where taking the natural log (ln) of the values gives a normal distribution. Think of it as a “skewed bell curve” that can’t go below zero.

  • Models things that grow by percentage (like money or populations)
  • Always positive (can’t have negative values)
  • Shows up naturally in many real-world situations
  • Skewed right (long tail on right side)
  • Can’t be negative
  • Most values cluster near the left
  • Has a few very large values on the right
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import lognorm
# Create simple example data
normal_data = np.random.normal(0, 0.5, 1000)
lognormal_data = np.exp(normal_data) # Transform to lognormal
# Plot both to show relationship
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Normal distribution plot
ax1.hist(normal_data, bins=30, density=True)
ax1.set_title('Normal Distribution')
ax1.set_xlabel('Value')
# Log normal distribution plot
ax2.hist(lognormal_data, bins=30, density=True)
ax2.set_title('Log Normal Distribution')
ax2.set_xlabel('Value')
plt.tight_layout()
plt.show()

Log Normal Distribution

  1. House Prices
def simulate_house_prices(median_price=300000, spread=0.5, n_houses=1000):
"""Simulate house prices in a city"""
mu = np.log(median_price) # Convert median to log scale
prices = np.random.lognormal(mu, spread, n_houses)
return {
'median': np.median(prices),
'mean': np.mean(prices),
'cheapest': np.min(prices),
'most_expensive': np.max(prices)
}
# Example
prices = simulate_house_prices()
print(f"Median house: ${prices['median']:,.0f}")
print(f"Average house: ${prices['mean']:,.0f}")
print(f"Price range: ${prices['cheapest']:,.0f} to ${prices['most_expensive']:,.0f}")
// Median house: $293,919
// Average house: $334,266
// Price range: $54,054 to $1,700,471
  1. Investment Growth
def investment_scenarios(initial=10000, years=30, risk=0.15):
"""Simulate possible investment outcomes"""
annual_return = 0.07 # 7% average return
# Generate 1000 possible scenarios
final_amounts = initial * np.random.lognormal(
(annual_return - risk**2/2) * years,
risk * np.sqrt(years),
1000
)
return {
'median': np.median(final_amounts),
'worst_10': np.percentile(final_amounts, 10),
'best_10': np.percentile(final_amounts, 90)
}
# Example
results = investment_scenarios()
print(f"Typical outcome: ${results['median']:,.0f}")
print(f"Range: ${results['worst_10']:,.0f} to ${results['best_10']:,.0f}")
// Typical outcome: $60,900
// Range: $19,375 to $173,670

Use log normal when your data:

  1. Can’t be negative (like prices or sizes)
  2. Is skewed right (has a long tail to the right)
  3. Grows by percentages rather than fixed amounts
  • Most values will be below the mean
  • Median is less than mean
  • A few very large values will pull the mean up
  • Multiplying/dividing by a constant shifts the distribution
  1. Using it for negative values (impossible)
  2. Expecting symmetry (it’s always skewed)
  3. Using regular averages (use geometric mean instead)
  4. Forgetting to transform back from log scale
# Example showing common statistics
def log_normal_stats(data):
"""Calculate key statistics for log-normal data"""
log_data = np.log(data)
return {
'median': np.exp(np.mean(log_data)), # Geometric mean
'mean': np.mean(data), # Arithmetic mean
'typical_range': [
np.exp(np.mean(log_data) - np.std(log_data)),
np.exp(np.mean(log_data) + np.std(log_data))
]
}
# Example with salary data
salaries = np.random.lognormal(11, 0.5, 1000) # Generate sample salaries
stats = log_normal_stats(salaries)
print(f"Typical salary (median): ${stats['median']:,.0f}")
print(f"Average salary (mean): ${stats['mean']:,.0f}")
print(f"Typical range: ${stats['typical_range'][0]:,.0f} to ${stats['typical_range'][1]:,.0f}")
// Typical salary (median): $31,623
// Average salary (mean): $34,859
// Typical range: $25,262 to $41,771

Remember: Log normal distributions are perfect for things that grow by percentages (like money) or can’t be negative (like sizes or times).

Power Law Distribution/Pareto Distribution

Section titled “Power Law Distribution/Pareto Distribution”

The Power Law Distribution (also known as Pareto Distribution) describes situations where a small number of items dominate the majority of outcomes. It’s often called the “80/20 rule” - where 80% of effects come from 20% of causes.

  • Long tail distribution (many small values, few very large ones)
  • No typical scale (looks similar at different scales)
  • Formula: P(x) ∝ x^(-α) where α > 0
  • Common α values: 2-3 for natural phenomena
  1. Wealth distribution (few people own most wealth)
  2. City populations (few cities have most people)
  3. Website traffic (few pages get most visits)
  4. Social media followers (few accounts have most followers)
import numpy as np
import matplotlib.pyplot as plt
def plot_power_law(alpha=2, n_samples=1000):
# Generate power law data
x = np.random.pareto(alpha, n_samples) + 1
# Plot on log-log scale
plt.figure(figsize=(10, 6))
plt.hist(x, bins=50, density=True, alpha=0.7)
plt.yscale('log')
plt.xscale('log')
plt.title(f'Power Law Distribution (α={alpha})')
plt.xlabel('Value (log scale)')
plt.ylabel('Frequency (log scale)')
plt.grid(True)
return plt
# Example usage
plot_power_law()
plt.show()

Power Law Distribution

def simulate_website_traffic(n_pages=100):
"""Simulate daily views for website pages"""
# Generate power law distributed views
views = np.random.pareto(2, n_pages) * 100
# Sort and analyze
sorted_views = np.sort(views)[::-1] # Descending order
total_views = np.sum(views)
# Find 80/20 point
cumsum = np.cumsum(sorted_views)
pages_for_80 = np.searchsorted(cumsum, 0.8 * total_views) + 1
return {
'top_pages': sorted_views[:5],
'pages_for_80_percent': pages_for_80,
'percent_pages': (pages_for_80 / n_pages) * 100
}
# Example
traffic = simulate_website_traffic()
print(f"Top 5 page views: {traffic['top_pages'].astype(int)}")
print(f"{traffic['pages_for_80_percent']} pages ({traffic['percent_pages']:.1f}%) "
f"generate 80% of traffic")
// Top 5 page views: [449 396 388 277 272]
// 40 pages (40.0%) generate 80% of traffic
  1. Analyzing extreme inequalities
  2. Modeling natural phenomena
  3. Risk assessment
  4. Network analysis
  • No “typical” or “average” value
  • Extreme values are more common than in normal distribution
  • Often indicates self-reinforcing processes
  • Important for risk management (extreme events more likely)

Note: Power laws appear in many natural and social systems. If you see huge differences between largest and smallest values, consider using a power law distribution.

Estimates are predictions or approximations of unknown values in data science. They help us make informed decisions based on available data.

  1. Point Estimate

    • Single value prediction
    • Example: Sample mean (x̄) estimates population mean (μ)
    data = [1, 2, 3, 4, 5]
    point_estimate = np.mean(data) # = 3
  2. Interval Estimate

    • Range of likely values
    • Common form: Confidence Intervals
    def confidence_interval(data, confidence=0.95):
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    margin = 1.96 * (std / np.sqrt(len(data))) # 95% CI
    return mean - margin, mean + margin
  1. Mean (Average)

    • Estimates central tendency
    • Best for symmetric data
    mean = sum(data) / len(data)
  2. Median

    • Estimates central value
    • Better for skewed data
    median = sorted(data)[len(data)//2]
  3. Sample Variance

    • Estimates data spread
    • Uses n-1 for unbiased estimate
    variance = sum((x - mean)**2 for x in data) / (len(data) - 1)
  1. Unbiased

    • Average estimate equals true value
    • Example: Sample mean is unbiased
  2. Consistent

    • More data = better estimate
    • Example: Law of large numbers
  3. Efficient

    • Minimum variance among similar estimators
    • Example: Sample mean vs single observation
class SalesEstimator:
def __init__(self, sales_data):
self.data = sales_data
def daily_estimate(self):
mean = np.mean(self.data)
ci = confidence_interval(self.data)
return {
'point_estimate': mean,
'confidence_interval': ci,
'reliability': 'High' if (ci[1]-ci[0]) < mean*0.2 else 'Low'
}
# Usage
sales = [100, 120, 80, 95, 110, 105, 90]
estimator = SalesEstimator(sales)
forecast = estimator.daily_estimate()
print(f"Expected sales: {forecast['point_estimate']:.0f}")
print(f"Range: {forecast['confidence_interval'][0]:.0f} to {forecast['confidence_interval'][1]:.0f}")
// Expected sales: 100
// Range: 90 to 110

Key Point: Choose estimators based on your data type and what you’re trying to predict.

Hypothesis testing is a method to make decisions about a population using sample data. It helps determine if an observed effect is statistically significant.

  1. State Hypotheses

    • Null (H₀): No effect/relationship exists
    • Alternative (H₁): Effect/relationship exists
  2. Choose Significance Level (α)

    • Usually 0.05 (5%)
    • Represents acceptable false positive rate
  3. Calculate Test Statistic

    • Based on sample data
    • Common tests: z-test, t-test
  4. Compare p-value

    • If p < α: Reject H₀
    • If p ≥ α: Fail to reject H₀
from scipy import stats
# Test if mean score is different from 70
scores = [72, 75, 68, 77, 69, 71, 74, 73]
# Run t-test
t_stat, p_value = stats.ttest_1samp(scores, 70)
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject null hypothesis")
else:
print("Fail to reject null hypothesis")
// p-value: 0.0615
// Fail to reject null hypothesis
  1. One-Sample t-test

    • Compare sample mean to known value
    def one_sample_ttest(data, expected_mean, alpha=0.05):
    t_stat, p_val = stats.ttest_1samp(data, expected_mean)
    return {
    'p_value': p_val,
    'significant': p_val < alpha,
    'test_stat': t_stat
    }
  2. Two-Sample t-test

    • Compare means of two groups
    def two_sample_ttest(group1, group2, alpha=0.05):
    t_stat, p_val = stats.ttest_ind(group1, group2)
    return {
    'p_value': p_val,
    'significant': p_val < alpha,
    'test_stat': t_stat
    }
  3. Chi-Square Test

    • Test categorical data relationships
    def chi_square_test(observed, expected, alpha=0.05):
    chi2, p_val = stats.chisquare(observed, expected)
    return {
    'p_value': p_val,
    'significant': p_val < alpha,
    'test_stat': chi2
    }
class DrugEffectTest:
def __init__(self, treatment_group, control_group):
self.treatment = treatment_group
self.control = control_group
def analyze(self):
# Run t-test
result = two_sample_ttest(self.treatment, self.control)
# Calculate effect size
effect = np.mean(self.treatment) - np.mean(self.control)
return {
'significant': result['significant'],
'p_value': result['p_value'],
'effect_size': effect,
'recommendation': 'Use drug' if (result['significant'] and effect > 0)
else 'Need more research'
}
# Example usage
treatment = [75, 82, 78, 80, 79] # Drug group
control = [70, 71, 73, 69, 72] # Placebo group
test = DrugEffectTest(treatment, control)
result = test.analyze()
print(f"Effect: {result['effect_size']:.1f} units")
print(f"P-value: {result['p_value']:.4f}")
print(f"Recommendation: {result['recommendation']}")
// Effect: 7.8 units
// P-value: 0.0004
// Recommendation: Use drug
  1. P-value Misinterpretation

    • P-value is NOT probability H₀ is true
    • Only shows how rare data is under H₀
  2. Multiple Testing

    • More tests = higher chance of false positives
    • Use Bonferroni correction: α/n for n tests
  3. Sample Size Issues

    • Too small: May miss real effects
    • Too large: May find tiny, meaningless effects

Key Point: Hypothesis testing helps make decisions but should be used with context and common sense.

P-value helps determine if a result is statistically significant. Think of it as “how surprising is this result if there was no real effect?”

  • Smaller p-value = stronger evidence against null hypothesis
  • Common threshold: p < 0.05 (5% significance level)
  • Range: 0 to 1 (0% to 100% probability)
from scipy import stats
import numpy as np
# Test if coin is fair
flips = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0] # 1=heads, 0=tails
heads = sum(flips)
total = len(flips)
# Calculate p-value (two-tailed test)
result = stats.binomtest(heads, total, p=0.5)
p_value = result.pvalue
# Print results
print(f"Number of heads: {heads} out of {total}")
print(f"P-value: {p_value:.3f}")
print(f"Interpretation: {'Reject null hypothesis' if p_value < 0.05 else 'Fail to reject null hypothesis'}")
# In this case:
# H0: Coin is fair (p = 0.5)
# H1: Coin is biased (p ≠ 0.5)
# Since p-value (0.344) > 0.05, we fail to reject H0
// Number of heads: 7 out of 10
// P-value: 0.344
// Interpretation: Fail to reject null hypothesis
  • p < 0.01: Very strong evidence
  • p < 0.05: Strong evidence
  • p < 0.10: Weak evidence
  • p ≥ 0.10: No evidence
  1. P-value is NOT:

    • Probability null hypothesis is true
    • Probability of being wrong
    • Effect size or importance
  2. Small p-value doesn’t mean:

    • Large effect
    • Practical significance
    • Reproducible results

Remember: P-value measures evidence strength, not effect size or practical importance

A Z-test is a statistical test used to determine if there’s a significant difference between a sample mean and a population mean when:

  1. Population standard deviation is known
  2. Sample size is large (n > 30)
z = (x̄ - μ) // √n)
where:
= sample mean
μ = population mean
σ = population standard deviation
n = sample size
import numpy as np
from scipy import stats
def z_test(sample, pop_mean, pop_std):
"""
Perform one-sample z-test
Args:
sample: List of sample values
pop_mean: Known population mean
pop_std: Known population standard deviation
"""
n = len(sample)
sample_mean = np.mean(sample)
z_score = (sample_mean - pop_mean) / (pop_std / np.sqrt(n))
p_value = 2 * (1 - stats.norm.cdf(abs(z_score))) # Two-tailed test
return {
'z_score': z_score,
'p_value': p_value,
'significant': p_value < 0.05
}
# Example: Test if class scores are different from population
scores = [85, 88, 92, 78, 90, 87, 86, 84, 89, 91] # Sample scores
pop_mean = 82 # Known population mean
pop_std = 5 # Known population standard deviation
result = z_test(scores, pop_mean, pop_std)
print(f"Z-score: {result['z_score']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Significant? {result['significant']}")
// Z-score: 3.16
// P-value: 0.0016
// Significant? True
  • Large sample size (n > 30)
  • Known population standard deviation
  • Testing means (not proportions)
  • Data is normally distributed
class ProductQuality:
def __init__(self, target_weight=100, pop_std=2):
self.target = target_weight
self.pop_std = pop_std
def test_batch(self, measurements):
result = z_test(measurements, self.target, self.pop_std)
return {
'pass': not result['significant'], # Pass if no significant difference
'z_score': result['z_score'],
'p_value': result['p_value']
}
# Test a batch of products
weights = [101, 99, 100, 102, 98, 101, 99, 100, 101, 102]
qc = ProductQuality()
test = qc.test_batch(weights)
print(f"Batch {'passed' if test['pass'] else 'failed'} quality check")
print(f"Z-score: {test['z_score']:.2f}")
// Batch passed quality check
// Z-score: 0.47
  1. Assumptions

    • Normal distribution
    • Independent samples
    • Known population standard deviation
  2. Interpretation

    • |Z| > 1.96: Significant at 5% level
    • |Z| > 2.58: Significant at 1% level
    • Larger |Z| = stronger evidence
  3. Limitations

    • Requires known population std
    • Not good for small samples
    • Assumes normality

Note: If population standard deviation is unknown or sample size is small, use t-test instead.

The Student’s t Distribution is similar to the normal distribution but has heavier tails. It’s used when:

  • Sample size is small (n < 30)
  • Population standard deviation is unknown
  • Shape depends on degrees of freedom (df = n-1)
  • Approaches normal distribution as df increases
  • Used for t-tests and confidence intervals
from scipy import stats
import numpy as np
def t_distribution(df, x):
"""Calculate t-distribution probability density"""
return stats.t.pdf(x, df)
import matplotlib.pyplot as plt
# Create comparison plot
x = np.linspace(-4, 4, 100)
df_values = [1, 5, 30]
plt.figure(figsize=(10, 6))
for df in df_values:
plt.plot(x, stats.t.pdf(x, df),
label=f't (df={df})')
plt.plot(x, stats.norm.pdf(x),
label='Normal', linestyle='--')
plt.title('t Distribution vs Normal')
plt.legend()
plt.grid(True)

t Distribution vs Normal

  1. Small Sample Testing
def t_test(sample, pop_mean):
"""One-sample t-test"""
t_stat, p_value = stats.ttest_1samp(sample, pop_mean)
return {
't_statistic': t_stat,
'p_value': p_value,
'significant': p_value < 0.05
}
# Example
data = [25, 28, 29, 30, 31]
result = t_test(data, 27)
print(f"P-value: {result['p_value']:.4f}")
// P-value: 0.1951
  1. Confidence Intervals
def confidence_interval(data, confidence=0.95):
"""Calculate confidence interval"""
n = len(data)
mean = np.mean(data)
sem = stats.sem(data) # Standard error of mean
interval = stats.t.interval(confidence, n-1, mean, sem)
return interval
# Example
data = [10, 12, 11, 13, 9]
ci = confidence_interval(data)
print(f"95% CI: ({ci[0]:.1f}, {ci[1]:.1f})")
// 95% CI: (9.0, 13.0)
  • Small sample sizes
  • Unknown population standard deviation
  • Testing means or differences
  • Creating confidence intervals
  1. More spread out (heavier tails)
  2. Changes shape with sample size
  3. More conservative (wider intervals)
  4. Better for small samples

Note: As sample size increases (n > 30), t-distribution becomes very close to normal distribution.

T-tests help determine if there’s a significant difference between means. They’re especially useful when working with small samples (n < 30).

  1. One-Sample T-test

    • Compares sample mean to known value
    from scipy import stats
    # Example: Test if student scores differ from target (75)
    scores = [72, 78, 75, 80, 73]
    t_stat, p_value = stats.ttest_1samp(scores, 75)
  2. Independent T-test

    • Compares means of two independent groups
    # Compare treatment vs control
    treatment = [75, 82, 78, 80]
    control = [70, 71, 73, 69]
    t_stat, p_value = stats.ttest_ind(treatment, control)
  3. Paired T-test

    • Compares before/after measurements
    # Compare before/after scores
    before = [70, 72, 71, 73]
    after = [75, 78, 77, 76]
    t_stat, p_value = stats.ttest_rel(before, after)
def run_ttest(sample_data, expected_mean=0, alpha=0.05):
"""
Run one-sample t-test
Args:
sample_data: List of values
expected_mean: Value to test against
alpha: Significance level
"""
# Calculate t-statistic and p-value
t_stat, p_value = stats.ttest_1samp(sample_data, expected_mean)
return {
't_statistic': t_stat,
'p_value': p_value,
'significant': p_value < alpha,
'mean_difference': np.mean(sample_data) - expected_mean
}
# Example usage
scores = [85, 82, 88, 84, 86]
result = run_ttest(scores, expected_mean=80)
print(f"T-statistic: {result['t_statistic']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Mean difference: {result['mean_difference']:.1f}")
print(f"Significant? {result['significant']}")
// T-statistic: 5.00
// P-value: 0.0075
// Mean difference: 5.0
// Significant? True
  1. One-Sample T-test

    • Testing against known value
    • Example: Are test scores different from 70?
  2. Independent T-test

    • Comparing two separate groups
    • Example: Does treatment group differ from control?
  3. Paired T-test

    • Before/after measurements
    • Example: Did training improve scores?
  1. T-statistic

    • Larger = stronger evidence
    • Sign shows direction (positive/negative)
  2. P-value

    • < 0.05: Statistically significant
    • ≥ 0.05: Not significant
  3. Effect Size

    • Mean difference shows practical significance
    • Consider alongside p-value
  1. Don’t ignore assumptions:

    • Normal distribution
    • Independent samples
    • Equal variances (for independent t-test)
  2. Don’t rely only on p-values:

    def analyze_results(result):
    """Better interpretation of t-test"""
    return {
    'statistical_sig': result['p_value'] < 0.05,
    'practical_sig': abs(result['mean_difference']) > 5,
    'recommendation': 'Consider both statistical and practical significance'
    }
class DrugEffectiveness:
def __init__(self, treatment_data, control_data):
self.treatment = treatment_data
self.control = control_data
def analyze(self):
# Run t-test
t_stat, p_val = stats.ttest_ind(self.treatment, self.control)
# Calculate effect size
effect = np.mean(self.treatment) - np.mean(self.control)
return {
't_statistic': t_stat,
'p_value': p_val,
'effect_size': effect,
'recommendation': 'Effective' if (p_val < 0.05 and effect > 0)
else 'Not effective'
}
# Example usage
treatment = [95, 92, 98, 94, 96] # Drug group
control = [85, 87, 88, 86, 84] # Placebo group
study = DrugEffectiveness(treatment, control)
results = study.analyze()
print(f"Effect size: {results['effect_size']:.1f} units")
print(f"P-value: {results['p_value']:.4f}")
print(f"Recommendation: {results['recommendation']}")
// Effect size: 9.0 units
// P-value: 0.0001
// Recommendation: Effective

Key Points:

  • Use t-tests for small samples
  • Consider both p-value and effect size
  • Check assumptions before testing
  • Interpret results in context

T-tests and Z-tests are both used to compare means, but they have different use cases and assumptions.

FeatureZ-testT-test
Sample SizeLarge (n > 30)Any size
Population σMust be knownCan be unknown
DistributionNormalStudent’s t
Tail WeightLight tailsHeavy tails
import numpy as np
from scipy import stats
def choose_test(sample, pop_mean, pop_std=None):
"""
Choose and run appropriate test
Args:
sample: Data to test
pop_mean: Population mean to test against
pop_std: Population standard deviation (if known)
"""
n = len(sample)
if n > 30 and pop_std is not None:
# Use Z-test
z_score = (np.mean(sample) - pop_mean) / (pop_std / np.sqrt(n))
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
test_type = 'Z-test'
stat = z_score
else:
# Use T-test
stat, p_value = stats.ttest_1samp(sample, pop_mean)
test_type = 'T-test'
return {
'test_type': test_type,
'statistic': stat,
'p_value': p_value
}
# Example usage
small_sample = [75, 82, 78, 80, 79] # n = 5
large_sample = np.random.normal(75, 5, 50) # n = 50
pop_std = 5
# Test both samples
small_result = choose_test(small_sample, 70)
large_result = choose_test(large_sample, 70, pop_std)
print(f"Small sample: {small_result['test_type']}")
print(f"Large sample: {large_result['test_type']}")
// Small sample: T-test
// Large sample: Z-test
  1. Use Z-test when:

    • Large sample (n > 30)
    • Known population standard deviation
    • Need exact probability values
  2. Use T-test when:

    • Small sample size
    • Unknown population standard deviation
    • Working with sample statistics
  1. Tail Weight

    • Z-test: Lighter tails (normal distribution)
    • T-test: Heavier tails (more conservative)
  2. Critical Values

    • Z-test: Fixed values (e.g., ±1.96 for 95%)
    • T-test: Varies with sample size (degrees of freedom)
  3. Confidence Intervals

def compare_intervals(data, pop_std=None, confidence=0.95):
"""Compare Z and T confidence intervals"""
n = len(data)
mean = np.mean(data)
# Z interval (if pop_std known)
if pop_std:
z = stats.norm.ppf((1 + confidence) / 2)
z_margin = z * (pop_std / np.sqrt(n))
z_interval = (mean - z_margin, mean + z_margin)
# T interval
t_interval = stats.t.interval(confidence, n-1, mean, stats.sem(data))
return {
'z_interval': z_interval if pop_std else None,
't_interval': t_interval
}
  1. Default to T-test unless you:

    • Have large sample AND
    • Know population standard deviation
  2. Consider Sample Size

    • Small (n < 30): Always use t-test
    • Large (n > 30): Either test works
  3. Check Assumptions

    • Normal distribution
    • Independent observations
    • Random sampling

Note: When in doubt, use t-test. It’s more conservative and safer for most situations.

Type I and Type II errors are fundamental concepts in hypothesis testing that help us understand the two ways we can make mistakes.

Error TypeDefinitionCommon NameExample
Type IRejecting true null hypothesisFalse PositiveConvicting innocent person
Type IIFailing to reject false null hypothesisFalse NegativeMissing actual disease
import numpy as np
from scipy import stats
def test_with_errors(data, null_mean, alpha=0.05):
"""
Run hypothesis test and explain possible errors
Args:
data: Sample data
null_mean: Null hypothesis mean
alpha: Significance level (Type I error rate)
"""
# Run t-test
t_stat, p_value = stats.ttest_1samp(data, null_mean)
# Decision
reject_null = p_value < alpha
# Explain possible errors
if reject_null:
error_type = "Type I error possible (false positive)"
else:
error_type = "Type II error possible (false negative)"
return {
'p_value': p_value,
'reject_null': reject_null,
'possible_error': error_type
}
# Example
scores = [75, 82, 78, 80, 79]
result = test_with_errors(scores, null_mean=70)
print(f"Decision: {'Reject' if result['reject_null'] else 'Fail to reject'} null")
print(f"Possible error: {result['possible_error']}")
// Decision: Reject null
// Possible error: Type I error possible (false positive)
  1. Type I Error (α)

    • Rejecting H₀ when it’s true
    • Probability = significance level (α)
    • Usually set to 0.05 (5%)
    • More serious in legal/medical contexts
  2. Type II Error (β)

    • Not rejecting H₀ when it’s false
    • Related to test power (1 - β)
    • Affected by:
      • Sample size
      • Effect size
      • Significance level
class MedicalTest:
def __init__(self, sensitivity=0.95, specificity=0.98):
self.sensitivity = sensitivity # True Positive Rate
self.specificity = specificity # True Negative Rate
def test_patient(self, has_disease, n_tests=1000):
"""Simulate medical tests"""
if has_disease:
# Type II error rate = 1 - sensitivity
false_negatives = np.random.binomial(n_tests, 1-self.sensitivity)
return {
'condition': 'Sick',
'errors': false_negatives,
'error_type': 'Type II',
'error_rate': false_negatives/n_tests
}
else:
# Type I error rate = 1 - specificity
false_positives = np.random.binomial(n_tests, 1-self.specificity)
return {
'condition': 'Healthy',
'errors': false_positives,
'error_type': 'Type I',
'error_rate': false_positives/n_tests
}
# Example usage
test = MedicalTest()
sick_results = test.test_patient(has_disease=True)
healthy_results = test.test_patient(has_disease=False)
print(f"Type II Error Rate (missed disease): {sick_results['error_rate']:.1%}")
print(f"Type I Error Rate (false alarms): {healthy_results['error_rate']:.1%}")
// Type II Error Rate (missed disease): 5.7%
// Type I Error Rate (false alarms): 2.2%
  1. Error Rate Relationship

    • Decreasing α increases β
    • Decreasing β increases α
    • Can’t minimize both simultaneously
  2. Practical Considerations

    • Cost of each error type
    • Available sample size
    • Required confidence level
  1. Not Considering Both Errors

    def assess_test_quality(alpha, beta):
    """Evaluate overall test quality"""
    power = 1 - beta
    return {
    'false_positive_rate': alpha,
    'false_negative_rate': beta,
    'power': power,
    'quality': 'Good' if (alpha < 0.05 and power > 0.8) else 'Poor'
    }
  2. Focusing Only on Significance

    • Consider practical significance
    • Balance both error types
    • Account for sample size

Key Point: Always consider both error types and their real-world implications when making decisions based on statistical tests.

Understanding Type I and Type II errors is crucial for making informed decisions in data analysis. By balancing these errors, we can improve the reliability and practicality of our statistical tests.

Remember, the goal is not to minimize errors, but to make informed decisions with the best possible information.

  • Type I Error: Rejecting true null hypothesis
  • Type II Error: Failing to reject false null hypothesis
  • Trade-offs: Can’t minimize both simultaneously
  • Practical Considerations: Cost, sample size, confidence level
  • Common Mistakes: Focusing only on significance

Key Point: Always consider both error types and their real-world implications when making decisions based on statistical tests.

Bayesian statistics is an approach that updates beliefs based on new evidence. It uses probability to express uncertainty about events and parameters.

  1. Prior Probability (Prior)

    • Initial belief before new data
    • Based on previous knowledge
  2. Likelihood

    • Probability of data given parameters
    • How well parameters explain data
  3. Posterior Probability

    • Updated belief after seeing data
    • Combines prior and likelihood

The fundamental formula:

P(A|B) = P(B|A) * P(A) / P(B)
where:
P(A|B) = Posterior
P(B|A) = Likelihood
P(A) = Prior
P(B) = Evidence
def bayes_update(prior_prob, likelihood, evidence):
"""Calculate posterior probability"""
return (likelihood * prior_prob) / evidence
# Example: Disease testing
prior = 0.01 # 1% have disease
likelihood = 0.95 # Test 95% accurate for sick people
evidence = 0.05 # 5% test positive
posterior = bayes_update(prior, likelihood, evidence)
print(f"Probability of disease given positive test: {posterior:.1%}")
// Probability of disease given positive test: 19.0%
  1. Spam Detection
class SpamFilter:
def __init__(self):
self.word_probs = {
'money': 0.8, # P(spam|word)
'friend': 0.2,
'buy': 0.7
}
self.spam_prior = 0.3 # 30% of emails are spam
def classify(self, words):
prob = self.spam_prior
for word in words:
if word in self.word_probs:
prob *= self.word_probs[word]
return prob > 0.5
# Example
filter = SpamFilter()
email = ['money', 'buy']
is_spam = filter.classify(email)
print(f"Email classified as: {'Spam' if is_spam else 'Not Spam'}")
// Email classified as: Not Spam
  1. A/B Testing
def bayesian_ab_test(a_success, a_total, b_success, b_total):
"""Compare two versions using Bayesian approach"""
from scipy import stats
# Calculate success rates
rate_a = a_success / a_total
rate_b = b_success / b_total
# Calculate confidence
confidence = stats.beta(a_success + 1, a_total - a_success + 1).mean()
return {
'better_version': 'A' if rate_a > rate_b else 'B',
'confidence': confidence,
'recommend_switch': confidence > 0.95
}
# Example
result = bayesian_ab_test(120, 1000, 150, 1000)
print(f"Better version: {result['better_version']}")
print(f"Confidence: {result['confidence']:.1%}")
// Better version: B
// Confidence: 98.2%
  1. Intuitive Updates

    • Naturally updates beliefs with new data
    • Handles uncertainty better
  2. Prior Knowledge

    • Incorporates existing knowledge
    • More realistic in real-world scenarios
  3. Interpretable Results

    • Gives probability distributions
    • Easier to understand for decisions
  1. Medical Diagnosis

    • Update disease probability with test results
    • Consider patient history (prior)
  2. Machine Learning

    • Parameter estimation
    • Model uncertainty
    • Neural network weights
  3. Risk Assessment

    • Financial decisions
    • Project planning
    • Insurance
class BayesianAnalyzer:
def __init__(self, prior=0.5):
self.prior = prior
self.data = []
def update(self, new_data, likelihood):
"""Update beliefs with new data"""
posterior = (likelihood * self.prior) / (
likelihood * self.prior +
(1 - likelihood) * (1 - self.prior)
)
self.prior = posterior
self.data.append(new_data)
return posterior
def get_confidence(self):
return f"{self.prior:.1%}"
# Example usage
analyzer = BayesianAnalyzer()
print(f"Initial belief: {analyzer.get_confidence()}")
# Update with new evidence
evidence = [True, True, False, True]
for e in evidence:
likelihood = 0.8 if e else 0.2
analyzer.update(e, likelihood)
print(f"Updated belief: {analyzer.get_confidence()}")
// Initial belief: 50.0%
// Updated belief: 94.1%
  1. Start with Prior

    • Use existing knowledge
    • Be explicit about assumptions
  2. Update with Data

    • Use Bayes’ theorem
    • Consider evidence strength
  3. Make Decisions

    • Use posterior probabilities
    • Consider uncertainty

Note: Bayesian statistics helps make better decisions by combining prior knowledge with new evidence in a natural way.

Bayesian statistics is a powerful tool for making decisions based on uncertain data. It provides a natural way to update beliefs with new evidence and can be used in a wide range of applications.

By understanding the basics and practicing with real-world examples, you can start using Bayesian methods to improve your decision-making process.

  • Bayes’ Theorem: P(A|B) = P(B|A) * P(A) / P(B)
  • Prior: Initial belief
  • Likelihood: Data explanation
  • Posterior: Updated belief
  • Common Uses: Medical, ML, risk assessment

Key Point: Bayesian statistics helps make better decisions by combining prior knowledge with new evidence in a natural way.

A confidence interval shows the likely range for a population value, while margin of error shows how far the estimate might be from the true value.

margin_of_error = z_score * (standard_deviation / √sample_size)
confidence_interval = sample_mean ± margin_of_error
import numpy as np
from scipy import stats
def calculate_confidence_interval(data, confidence=0.95):
"""
Calculate confidence interval for a dataset
Args:
data: List of numbers
confidence: Confidence level (default 95%)
"""
mean = np.mean(data)
std_error = stats.sem(data)
interval = stats.t.interval(confidence, len(data)-1, mean, std_error)
return {
'mean': mean,
'lower': interval[0],
'upper': interval[1],
'margin': interval[1] - mean
}
# Example usage
scores = [72, 75, 68, 77, 69, 71, 74, 73]
ci = calculate_confidence_interval(scores)
print(f"Mean: {ci['mean']:.1f}")
print(f"95% CI: ({ci['lower']:.1f}, {ci['upper']:.1f}")
print(f"Margin of Error: ±{ci['margin']:.1f}")
  • 90% → z = 1.645
  • 95% → z = 1.96 (most common)
  • 99% → z = 2.576
class SurveyAnalyzer:
def __init__(self, responses, confidence=0.95):
self.data = responses
self.confidence = confidence
def analyze(self):
ci = calculate_confidence_interval(self.data, self.confidence)
return {
'estimate': ci['mean'],
'margin': ci['margin'],
'range': f"({ci['lower']:.1f} - {ci['upper']:.1f})",
'reliability': 'High' if ci['margin'] < 5 else 'Low'
}
# Example: Customer satisfaction scores (1-10)
scores = [8, 7, 9, 8, 8, 7, 9, 8, 7, 8]
survey = SurveyAnalyzer(scores)
results = survey.analyze()
print(f"Customer satisfaction: {results['estimate']:.1f} ± {results['margin']:.1f}")
print(f"Confidence range: {results['range']}")
  1. Sample Size

    • Larger sample = smaller margin
    • Doubles sample size → reduces margin by √2
  2. Confidence Level

    • Higher confidence = larger margin
    • 99% CI wider than 95% CI
  3. Population Variability

    • More variable data = larger margin
    • Less consistent = less certain
def required_sample_size(margin_of_error, confidence=0.95, std_dev=0.5):
"""
Calculate required sample size for desired margin of error
Args:
margin_of_error: Desired margin (as decimal)
confidence: Confidence level (default 95%)
std_dev: Estimated standard deviation
"""
z = stats.norm.ppf((1 + confidence) / 2)
n = ((z * std_dev) / margin_of_error) ** 2
return int(np.ceil(n))
# Example: For ±3% margin at 95% confidence
sample_size = required_sample_size(0.03)
print(f"Required sample size: {sample_size}")
  1. Interpretation

    • “We are 95% confident the true value is in this range”
    • Not “95% of data falls in this range”
  2. Trade-offs

    • Narrower CI = Less confident
    • Wider CI = More confident
    • Balance precision vs certainty
  3. Common Uses

    • Polls and surveys
    • Quality control
    • Scientific research
    • Medical studies
  1. Report Both

    • Always show mean and margin
    • Example: 75.2 ± 3.4
  2. Consider Context

    • Is margin acceptable for decisions?
    • What’s the cost of being wrong?
  3. Sample Size

    • Get enough data for desired precision
    • Consider cost vs accuracy needs

Remember: Larger sample sizes give more precise estimates (smaller margins of error), but there’s always some uncertainty in real-world measurements.

The Chi-Square Test helps determine if there’s a relationship between categorical variables or if observed data matches expected patterns.

  1. Independence Test

    • Tests if two variables are related
    • Example: Is gender related to voting preference?
  2. Goodness of Fit

    • Tests if data matches expected distribution
    • Example: Are dice rolls fair?
from scipy import stats
import numpy as np
def chi_square_test(observed, expected=None):
"""
Run chi-square test
Args:
observed: Observed frequencies
expected: Expected frequencies (optional)
"""
if expected is None:
# Assume equal distribution
expected = [sum(observed) / len(observed)] * len(observed)
chi2, p_value = stats.chisquare(observed, expected)
return {
'chi_square': chi2,
'p_value': p_value,
'significant': p_value < 0.05
}
# Example: Test if dice is fair
rolls = [10, 8, 12, 9, 11, 10] # Frequencies of 1-6
result = chi_square_test(rolls)
print(f"P-value: {result['p_value']:.3f}")
print(f"Fair dice? {'No' if result['significant'] else 'Yes'}")
class SurveyAnalyzer:
def __init__(self, responses):
self.data = np.array(responses)
def test_independence(self, var1, var2):
"""Test if two variables are independent"""
contingency = np.array([
[sum((var1 == i) & (var2 == j))
for j in set(var2)]
for i in set(var1)
])
chi2, p_val, _, _ = stats.chi2_contingency(contingency)
return {
'chi_square': chi2,
'p_value': p_val,
'related': p_val < 0.05
}
# Example: Test if education level relates to job satisfaction
education = [1, 2, 2, 3, 2, 1, 3, 2, 2, 1] # 1=HS, 2=College, 3=Graduate
satisfaction = [1, 2, 2, 3, 2, 1, 3, 2, 1, 1] # 1=Low, 2=Med, 3=High
analyzer = SurveyAnalyzer(list(zip(education, satisfaction)))
result = analyzer.test_independence(education, satisfaction)
print(f"Related? {'Yes' if result['related'] else 'No'}")
  1. Independence Test

    • Comparing categorical variables
    • Testing relationships
    • Survey analysis
  2. Goodness of Fit

    • Testing distributions
    • Quality control
    • Validating models
  1. Assumptions

    • Independent observations
    • Large enough sample
    • Expected frequencies > 5
  2. Interpretation

    • Small p-value = significant relationship
    • Large chi-square = stronger relationship
  3. Limitations

    • Only for categorical data
    • Doesn’t show strength/direction
    • Sensitive to sample size
def common_applications():
return {
'market_research': [
'Brand preference vs age group',
'Product choice vs income level'
],
'quality_control': [
'Defect patterns',
'Process consistency'
],
'medical_research': [
'Treatment effectiveness',
'Risk factor analysis'
]
}

Remember: Chi-square tests help find patterns in categorical data but don’t tell you about the strength or direction of relationships.