Skip to content

Machine Learning Algorithm

Simple linear regression is a basic predictive modeling technique that models the relationship between one input variable (X) and one output variable (Y).

  1. The Line Equation

    Y = mX + b
    • Y: Predicted value (dependent variable)
    • X: Input value (independent variable)
    • m: Slope (how much Y changes when X changes)
    • b: Y-intercept (value of Y when X = 0)
  2. Finding Best Fit

    • Uses “least squares” method
    • Minimizes the sum of squared differences between predicted and actual Y values
    • Lower error = better fit

Predicting house prices based on square footage:

  • X = Square footage (input)
  • Y = House price (prediction)
  • m = Price increase per square foot
  • b = Base price
  • One input variable, one output variable
  • Data shows roughly linear pattern
  • Quick insights needed
  • Basic predictions
  • Only handles linear relationships
  • Sensitive to outliers
  • Too simple for complex problems
# Basic implementation using sklearn
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1], [2], [3], [4]]) # Input data
y = np.array([2, 4, 6, 8]) # Output data
model = LinearRegression()
model.fit(X, y)
# Predict new value
prediction = model.predict([[5]])

Real World Example: House Price Prediction

Section titled “Real World Example: House Price Prediction”

Let’s predict house prices using square footage:

import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Sample data
house_data = {
'sqft': [1200, 1500, 1800, 2200, 2500],
'price': [150000, 175000, 210000, 250000, 290000]
}
df = pd.DataFrame(house_data)
# Prepare data
X = df[['sqft']].values
y = df['price'].values
# Train model
model = LinearRegression()
model.fit(X, y)
# Get equation components
slope = model.coef_[0]
intercept = model.intercept_
print(f"Price = {slope:.2f} × sqft + {intercept:.2f}")
# Predict price for a 2000 sqft house
new_house = [[2000]]
predicted_price = model.predict(new_house)
print(f"Predicted price for 2000 sqft: ${predicted_price[0]:,.2f}")

What This Shows:

  • Each square foot increases price by a fixed amount (slope)
  • Base price is the intercept
  • Model learns from existing house prices
  • Can predict prices for new houses

Output Example:

Price = 110.23 × sqft + 15000.00
Predicted price for 2000 sqft: $235,460.00

The cost function helps us measure how well our linear regression line fits the data. Think of it as a “wrongness score” - the lower the score, the better the fit.

  1. Mean Squared Error (MSE)

    MSE = (1/n) * Σ(y_actual - y_predicted)²
    • n: Number of data points
    • y_actual: Real value
    • y_predicted: Model’s prediction
    • Σ: Sum everything
  2. Why Square the Errors?

    • Makes all errors positive
    • Penalizes big mistakes more
    • Easier to calculate the minimum
import numpy as np
import matplotlib.pyplot as plt
# Sample data
X = np.array([1, 2, 3, 4, 5])
y_actual = np.array([2, 4, 5, 4, 5])
# Bad fit line
m_bad = 0.5
b_bad = 1
y_bad = m_bad * X + b_bad
# Good fit line
m_good = 0.8
b_good = 1.5
y_good = m_good * X + b_good
# Calculate MSE
mse_bad = np.mean((y_actual - y_bad)**2)
mse_good = np.mean((y_actual - y_good)**2)
print(f"Bad fit MSE: {mse_bad:.2f}")
print(f"Good fit MSE: {mse_good:.2f}")
  1. Start with random slope (m) and intercept (b)
  2. Calculate MSE
  3. Adjust m and b to reduce MSE
  4. Repeat until MSE can’t get lower
from sklearn.metrics import mean_squared_error
# Sample data
X = np.array([[1], [2], [3], [4]])
y_true = np.array([2, 4, 6, 8])
# Train model
model = LinearRegression()
model.fit(X, y_true)
# Make predictions
y_pred = model.predict(X)
# Calculate cost
mse = mean_squared_error(y_true, y_pred)
print(f"Model's MSE: {mse:.2f}")
  • Lower cost = better fit
  • Perfect fit has cost of 0
  • Used to train the model
  • Helps prevent overfitting

Gradient descent helps find the best line by gradually adjusting the slope and intercept. Think of it like walking downhill to find the lowest point.

  1. Basic Steps

    For each step:
    1. Calculate current error
    2. Find direction of steepest descent
    3. Take a small step in that direction
    4. Repeat until minimal improvement
  2. Learning Rate (α)

    • Controls step size
    • Too large: might overshoot
    • Too small: takes too long
    • Typical values: 0.01 to 0.1
import numpy as np
def gradient_descent(X, y, learning_rate=0.01, epochs=1000):
m = 0 # Initial slope
b = 0 # Initial intercept
n = len(X) # Number of data points
for _ in range(epochs):
# Current predictions
y_pred = m * X + b
# Calculate gradients
dm = (-2/n) * sum(X * (y - y_pred))
db = (-2/n) * sum(y - y_pred)
# Update parameters
m = m - learning_rate * dm
b = b - learning_rate * db
return m, b
# Example usage
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
final_m, final_b = gradient_descent(X, y)
print(f"Final equation: y = {final_m:.2f}x + {final_b:.2f}")
  1. Batch Gradient Descent

    • Uses all data points
    • More stable
    • Slower for large datasets
  2. Stochastic Gradient Descent

    • Uses one random point
    • Faster but noisier
    • Better for large datasets
  • Maximum iterations reached
  • Error change is very small
  • Gradient becomes very small
  1. Not Converging

    • Reduce learning rate
    • Normalize input data
    • Check for data issues
  2. Slow Convergence

    • Increase learning rate
    • Use momentum
    • Try different initialization
def gradient_descent_with_stopping(X, y, learning_rate=0.01,
tolerance=1e-6, max_epochs=1000):
m = b = 0
prev_cost = float('inf')
for epoch in range(max_epochs):
y_pred = m * X + b
cost = np.mean((y - y_pred) ** 2)
# Check for convergence
if abs(prev_cost - cost) < tolerance:
print(f"Converged at epoch {epoch}")
break
# Update parameters
dm = (-2/len(X)) * sum(X * (y - y_pred))
db = (-2/len(X)) * sum(y - y_pred)
m -= learning_rate * dm
b -= learning_rate * db
prev_cost = cost
return m, b
  • Automatically finds best parameters
  • Learning rate is crucial
  • May need multiple runs
  • Works for many ML algorithms

Multiple linear regression predicts an outcome using two or more input variables. Think of it as simple linear regression with more features.

  1. The Equation
    Y = b + m₁X₁ + m₂X₂ + ... + mₙXₙ
    • Y: Predicted value
    • b: Base value (intercept)
    • m₁, m₂, etc.: Coefficients for each feature
    • X₁, X₂, etc.: Input features

Real World Example: House Price Prediction

Section titled “Real World Example: House Price Prediction”
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Sample data
house_data = {
'sqft': [1200, 1500, 1800, 2200, 2500],
'bedrooms': [2, 3, 3, 4, 4],
'age': [5, 10, 15, 5, 8],
'price': [150000, 175000, 210000, 250000, 290000]
}
df = pd.DataFrame(house_data)
# Prepare features and target
X = df[['sqft', 'bedrooms', 'age']]
y = df['price']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Show coefficients
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature}: ${coef:,.2f} impact")
print(f"Base price: ${model.intercept_:,.2f}")
# Predict new house
new_house = [[2000, 3, 10]] # 2000 sqft, 3 beds, 10 years old
prediction = model.predict(new_house)
print(f"\nPredicted price: ${prediction[0]:,.2f}")

Good features are:

  • Related to what you’re predicting
  • Independent from each other
  • Actually available in real use
  1. Handle Missing Values

    # Fill missing values
    df.fillna(df.mean(), inplace=True)
  2. Scale Features

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# Make predictions
y_pred = model.predict(X_test)
# Calculate metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"R² Score: {r2:.2f}")
print(f"RMSE: ${rmse:,.2f}")
  • More features = more complex model
  • Features should be meaningful
  • Watch for multicollinearity
  • Scale features if needed
  • Check model assumptions
  • Assumes linear relationships
  • Sensitive to outliers
  • Can overfit with too many features
  • Features should be independent

Performance metrics help us understand how well our model is performing. Here are the key metrics for regression models.

  1. Mean Squared Error (MSE)

    from sklearn.metrics import mean_squared_error
    mse = mean_squared_error(y_true, y_pred)
    • Measures average squared difference between predictions and actual values
    • Penalizes larger errors more
    • Always positive, lower is better
  2. Root Mean Squared Error (RMSE)

    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    • Square root of MSE
    • Same units as target variable
    • Easier to interpret than MSE
  3. R-squared (R²)

    from sklearn.metrics import r2_score
    r2 = r2_score(y_true, y_pred)
    • Shows percentage of variance explained
    • Range: 0 to 1 (higher is better)
    • 0.7 means model explains 70% of variance
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
def evaluate_model(y_true, y_pred):
# Calculate metrics
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
# Print results
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²: {r2:.2f}")
print(f"MAE: {mae:.2f}")
return mse, rmse, r2, mae
# Example usage
y_true = np.array([10, 20, 30, 40, 50])
y_pred = np.array([12, 18, 31, 38, 51])
evaluate_model(y_true, y_pred)
from sklearn.model_selection import cross_val_score
def cv_evaluate(model, X, y, cv=5):
# Get cross-validation scores
scores = cross_val_score(model, X, y, cv=cv)
print(f"CV Scores: {scores}")
print(f"Mean Score: {scores.mean():.2f}")
print(f"Std Dev: {scores.std():.2f}")
import matplotlib.pyplot as plt
def plot_predictions(y_true, y_pred):
plt.scatter(y_true, y_pred)
plt.plot([y_true.min(), y_true.max()],
[y_true.min(), y_true.max()],
'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Actual vs Predicted')
plt.show()
  1. Use RMSE when:

    • You need error in same units as target
    • Large errors are particularly bad
  2. Use R² when:

    • Explaining model to non-technical people
    • Comparing different models
  3. Use Cross-validation when:

    • Dataset is small
    • Need reliable performance estimate
  • Use multiple metrics
  • Consider your audience
  • Check for overfitting
  • Validate on test data
  • Compare to baseline

These are the three most important error metrics for regression models. Let’s understand each one simply.

MAE = (1/n) * Σ|y_true - y_pred|

What it means:

  • Average of absolute differences between predictions and actual values
  • Easier to understand
  • All errors weighted equally
  • Same unit as your data
from sklearn.metrics import mean_absolute_error
# Example
y_true = [10, 20, 30]
y_pred = [12, 18, 35]
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae}") # Shows average error in original units
MSE = (1/n) * Σ(y_true - y_pred)²

What it means:

  • Square the errors before averaging
  • Penalizes large errors more
  • Units are squared (if predicting dollars, MSE is in dollars²)
  • Always positive
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse}")
RMSE = √MSE

What it means:

  • Square root of MSE
  • Back to original units
  • Still penalizes large errors
  • Most commonly used metric
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"RMSE: {rmse}")
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
def compare_metrics(y_true, y_pred):
# Calculate all metrics
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
print("Example predictions vs actual:")
for t, p in zip(y_true, y_pred):
print(f"Actual: {t}, Predicted: {p}, Difference: {abs(t-p)}")
print(f"\nMAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
# Test with house prices (in thousands)
actual = [200, 300, 400, 500]
predicted = [180, 320, 390, 510]
compare_metrics(actual, predicted)

Use MAE when:

  • You need simple interpretation
  • All errors equally important
  • Outliers are not a big concern

Use MSE when:

  • Large errors are more important
  • You’re training models
  • You don’t need interpretable units

Use RMSE when:

  • You want interpretable units
  • Large errors matter more
  • Comparing different models
  • MAE is most interpretable
  • RMSE is most popular
  • MSE is best for training
  • Always use same metric when comparing models

Understanding when your model learns too much or too little from the data.

  1. Underfitting

    • Model is too simple
    • Doesn’t capture important patterns
    • Poor performance on both training and test data
    • Like memorizing only basic rules
  2. Overfitting

    • Model is too complex
    • Learns noise in training data
    • Great on training data, poor on test data
    • Like memorizing answers instead of understanding
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Generate sample data
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3*X + np.sin(X)*2 + np.random.normal(0, 1.5, (100,1))
# Three models
def plot_fits():
# Underfit: straight line
underfit = LinearRegression()
underfit.fit(X, y)
y_under = underfit.predict(X)
# Good fit: polynomial degree 3
good = PolynomialFeatures(degree=3)
X_good = good.fit_transform(X)
model_good = LinearRegression().fit(X_good, y)
y_good = model_good.predict(X_good)
# Overfit: polynomial degree 15
overfit = PolynomialFeatures(degree=15)
X_over = overfit.fit_transform(X)
model_over = LinearRegression().fit(X_over, y)
y_over = model_over.predict(X_over)
# Plot
plt.scatter(X, y, color='gray', alpha=0.5, label='Data')
plt.plot(X, y_under, 'r-', label='Underfit')
plt.plot(X, y_good, 'g-', label='Good fit')
plt.plot(X, y_over, 'b-', label='Overfit')
plt.legend()
plt.show()
plot_fits()
  1. Underfitting Signs:

    • High training error
    • High validation error
    • Model makes very simple predictions
  2. Overfitting Signs:

    • Low training error
    • High validation error
    • Model makes complex, wiggly predictions

For Underfitting:

# Add more features
from sklearn.preprocessing import PolynomialFeatures
# Create more complex features
poly = PolynomialFeatures(degree=2)
X_more_features = poly.fit_transform(X)
# Try more complex model
from sklearn.ensemble import RandomForestRegressor
complex_model = RandomForestRegressor(n_estimators=100)

For Overfitting:

# Add regularization
from sklearn.linear_model import Ridge, Lasso
# L2 regularization
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
# L1 regularization
lasso = Lasso(alpha=1.0)
lasso.fit(X, y)
# Use cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
  1. Cross Validation
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train and evaluate
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Training score: {train_score:.2f}")
print(f"Testing score: {test_score:.2f}")
  1. Learning Curves
from sklearn.model_selection import learning_curve
def plot_learning_curve(model, X, y):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10))
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Cross-validation score')
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.legend()
plt.show()
  • Balance is crucial
  • Use validation data
  • Start simple, add complexity slowly
  • Monitor training vs validation performance
  • Use regularization when needed

Linear Regression with Ordinary Least Squares (OLS)

Section titled “Linear Regression with Ordinary Least Squares (OLS)”

OLS is the most common method to find the best-fitting line in linear regression. It minimizes the sum of squared differences between predictions and actual values.

  1. The Basic Idea

    • Find line that minimizes squared errors
    • Squared errors = (actual - predicted)²
    • Has a mathematical solution (no iteration needed)
  2. The Formula

    β = (X^T X)^(-1) X^T y

    Where:

    • β: Coefficients (slope and intercept)
    • X: Input features
    • y: Target values
    • ^T: Transpose
    • ^(-1): Matrix inverse
import numpy as np
def simple_ols(X, y):
# Add column of 1s for intercept
X = np.column_stack([np.ones(len(X)), X])
# Calculate coefficients
beta = np.linalg.inv(X.T @ X) @ X.T @ y
# Return intercept and slope
return beta[0], beta[1]
# Example usage
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
intercept, slope = simple_ols(X, y)
print(f"y = {slope:.2f}x + {intercept:.2f}")
import statsmodels.api as sm
def detailed_ols(X, y):
# Add constant
X = sm.add_constant(X)
# Fit model
model = sm.OLS(y, X).fit()
# Print summary
print(model.summary().tables[1])
return model
# Example with house prices
X = np.array([1500, 1800, 2000, 2200, 2500]) # Square footage
y = np.array([150000, 180000, 210000, 220000, 250000]) # Prices
model = detailed_ols(X, y)
from sklearn.linear_model import LinearRegression
def sklearn_ols(X, y):
# Reshape X if needed
if X.ndim == 1:
X = X.reshape(-1, 1)
# Fit model
model = LinearRegression()
model.fit(X, y)
print(f"Slope: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"R² Score: {model.score(X, y):.2f}")
return model
# Example usage
model = sklearn_ols(X, y)
  1. Linearity

    • Relationship is actually linear
    • Check with scatter plots
  2. Independence

    • Observations are independent
    • No time series patterns
  3. Normality

    • Residuals are normally distributed
    • Check with histogram
  4. Equal Variance

    • Spread of residuals is constant
    • Check with residual plot
def check_assumptions(model, X, y):
# Get predictions and residuals
y_pred = model.predict(X)
residuals = y - y_pred
# Plot residuals
plt.figure(figsize=(10, 4))
# Residual plot
plt.subplot(121)
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
# Histogram of residuals
plt.subplot(122)
plt.hist(residuals, bins=20)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
  • Simple and fast
  • Has exact solution
  • Works well for linear data
  • Check assumptions
  • Use with small/medium datasets

Regularization helps prevent overfitting by adding a penalty for large coefficients. Think of it as making the model simpler.

  1. Ridge (L2)

    Cost = MSE + α * (sum of squared coefficients)
    • Shrinks coefficients toward zero
    • Never makes them exactly zero
    • Good for handling multicollinearity
  2. Lasso (L1)

    Cost = MSE + α * (sum of absolute coefficients)
    • Can make coefficients exactly zero
    • Good for feature selection
    • Simpler models
from sklearn.linear_model import Ridge, Lasso
import numpy as np
# Sample data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.array([2, 3, 4, 5])
# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge coefficients:", ridge.coef_)
# Lasso regression
lasso = Lasso(alpha=1.0)
lasso.fit(X, y)
print("Lasso coefficients:", lasso.coef_)

Real World Example: House Price Prediction

Section titled “Real World Example: House Price Prediction”
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Prepare data
house_data = {
'sqft': [1200, 1500, 1800, 2200, 2500],
'bedrooms': [2, 3, 3, 4, 4],
'age': [5, 10, 15, 5, 8],
'price': [150000, 175000, 210000, 250000, 290000]
}
df = pd.DataFrame(house_data)
# Scale features
X = df[['sqft', 'bedrooms', 'age']]
y = df['price']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
# Try different alpha values
alphas = [0.1, 1.0, 10.0]
for alpha in alphas:
# Ridge
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
# Print coefficients
print(f"\nRidge (alpha={alpha})")
for name, coef in zip(X.columns, ridge.coef_):
print(f"{name}: {coef:.2f}")
from sklearn.model_selection import cross_val_score
def find_best_alpha(X, y, alphas):
best_score = -float('inf')
best_alpha = None
for alpha in alphas:
model = Ridge(alpha=alpha)
scores = cross_val_score(model, X, y, cv=5)
avg_score = scores.mean()
if avg_score > best_score:
best_score = avg_score
best_alpha = alpha
return best_alpha, best_score

Use Ridge when:

  • All features might be important
  • Features are correlated
  • Want to reduce coefficients

Use Lasso when:

  • Need feature selection
  • Want simpler model
  • Some features might be useless
from sklearn.linear_model import ElasticNet
# Combines Ridge and Lasso
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train, y_train)
  • Prevents overfitting
  • Makes models more stable
  • Scale features first
  • Try different alpha values
  • Use cross-validation

Polynomial regression handles curved relationships by adding powers of X (like X², X³) to linear regression. Think of it as making linear regression flexible enough to fit curves.

  1. Basic Idea
    y = b + m₁x + m₂x² + m₃x³ + ...
    • b: Base value (intercept)
    • x: Input feature
    • x²,x³: Powers of x
    • m₁,m₂,m₃: Coefficients
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np
def polynomial_regression(X, y, degree=2):
# Convert X to polynomial features
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X.reshape(-1, 1))
# Fit model
model = LinearRegression()
model.fit(X_poly, y)
return model, poly
# Example usage
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 8, 16, 32]) # Exponential pattern
model, poly = polynomial_regression(X, y, degree=2)
import matplotlib.pyplot as plt
def plot_polynomial_fit(X, y, degree):
# Fit model
model, poly = polynomial_regression(X, y, degree)
# Generate smooth points for curve
X_smooth = np.linspace(X.min(), X.max(), 100)
X_smooth_poly = poly.transform(X_smooth.reshape(-1, 1))
y_smooth = model.predict(X_smooth_poly)
# Plot
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X_smooth, y_smooth, color='red', label=f'Degree {degree}')
plt.legend()
plt.show()
return model
# Example with different degrees
degrees = [1, 2, 3]
for degree in degrees:
plot_polynomial_fit(X, y, degree)
# Daily temperature data
hours = np.array([0, 4, 8, 12, 16, 20, 24])
temp = np.array([15, 13, 18, 25, 23, 18, 15])
def fit_temperature_curve():
# Fit polynomial model
model, poly = polynomial_regression(hours, temp, degree=3)
# Generate smooth curve
hours_smooth = np.linspace(0, 24, 100)
hours_poly = poly.transform(hours_smooth.reshape(-1, 1))
temp_smooth = model.predict(hours_poly)
# Plot
plt.scatter(hours, temp, label='Actual')
plt.plot(hours_smooth, temp_smooth, 'r-', label='Predicted')
plt.xlabel('Hour of Day')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()
  1. Too Low (Underfitting)

    • Line too rigid
    • Misses important patterns
    • High error on both training and test
  2. Too High (Overfitting)

    • Line too wiggly
    • Fits noise in data
    • Perfect on training, bad on test
def find_best_degree(X, y, max_degree=10):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
best_score = -float('inf')
best_degree = 1
for degree in range(1, max_degree + 1):
model, poly = polynomial_regression(X_train, y_train, degree)
# Transform test data
X_test_poly = poly.transform(X_test.reshape(-1, 1))
score = model.score(X_test_poly, y_test)
if score > best_score:
best_score = score
best_degree = degree
return best_degree, best_score

Good For:

  • Curved relationships
  • Temperature cycles
  • Growth patterns
  • Physical processes

Not Good For:

  • Linear relationships (use simple linear)
  • Too many features
  • Very noisy data
  • Start with low degrees (2 or 3)
  • Check for overfitting
  • Scale features if needed
  • Use cross-validation
  • Balance complexity vs accuracy

A pipeline combines multiple steps (like scaling, polynomial features, and regression) into one clean workflow. Think of it as an assembly line for your data.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
def create_poly_pipeline(degree=2):
return Pipeline([
('scale', StandardScaler()), # Step 1: Scale features
('poly', PolynomialFeatures(degree=degree)), # Step 2: Create polynomial
('regression', LinearRegression()) # Step 3: Fit regression
])
# Simple usage
X = np.array([[1], [2], [3], [4]])
y = np.array([1, 4, 9, 16]) # y = x²
model = create_poly_pipeline(degree=2)
model.fit(X, y)
from sklearn.model_selection import cross_val_score
def find_best_polynomial(X, y, max_degree=5):
best_score = float('-inf')
best_degree = 1
for degree in range(1, max_degree + 1):
# Create pipeline
pipeline = create_poly_pipeline(degree)
# Get cross-validation scores
scores = cross_val_score(pipeline, X, y, cv=5)
avg_score = scores.mean()
print(f"Degree {degree}: Score = {avg_score:.3f}")
if avg_score > best_score:
best_score = avg_score
best_degree = degree
return best_degree, best_score
# Example usage
best_degree, best_score = find_best_polynomial(X, y)
print(f"\nBest degree: {best_degree}")

Real-World Example: House Price Prediction

Section titled “Real-World Example: House Price Prediction”
def house_price_pipeline():
# Sample data
house_data = {
'size': [1000, 1500, 1200, 1700, 2000],
'price': [200000, 300000, 250000, 350000, 450000]
}
X = np.array(house_data['size']).reshape(-1, 1)
y = np.array(house_data['price'])
# Create and fit pipeline
pipeline = create_poly_pipeline(degree=2)
pipeline.fit(X, y)
# Make predictions
sizes = np.linspace(min(X), max(X), 100).reshape(-1, 1)
predictions = pipeline.predict(sizes)
# Plot results
plt.scatter(X, y, color='blue', label='Actual')
plt.plot(sizes, predictions, color='red', label='Predicted')
plt.xlabel('House Size (sq ft)')
plt.ylabel('Price ($)')
plt.legend()
plt.show()
  1. Cleaner Code

    • All steps in one place
    • No data leakage
    • Easy to reproduce
  2. Automatic Order

    • Steps run in correct sequence
    • No manual data passing
    • Handles transformations automatically
  3. Easy Cross-Validation

    from sklearn.model_selection import GridSearchCV
    # Search for best parameters
    param_grid = {
    'poly__degree': [1, 2, 3, 4],
    'regression__fit_intercept': [True, False]
    }
    grid_search = GridSearchCV(
    create_poly_pipeline(),
    param_grid,
    cv=5
    )
    grid_search.fit(X, y)
  1. Data Scaling

    • StandardScaler
    • MinMaxScaler
    • RobustScaler
  2. Feature Creation

    • PolynomialFeatures
    • Custom transformers
  3. Model Fitting

    • LinearRegression
    • Ridge
    • Lasso
  • Always scale before polynomial features
  • Use cross-validation to avoid overfitting
  • Start with simple pipelines
  • Add steps as needed
  • Great for reproducibility

Ridge regression prevents overfitting by adding a penalty for large coefficients. Think of it as making the model prefer smaller, more reasonable numbers.

  1. Basic Formula
    Cost = MSE + α * (sum of squared coefficients)
    • MSE: Regular error term
    • α (alpha): Controls penalty strength
    • Higher α = smaller coefficients
from sklearn.linear_model import Ridge
import numpy as np
def ridge_regression(X, y, alpha=1.0):
# Create and fit model
model = Ridge(alpha=alpha)
model.fit(X, y)
return model
# Example usage
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.array([2, 3, 4, 5])
model = ridge_regression(X, y)
print("Coefficients:", model.coef_)
def plot_ridge_coefficients(X, y):
alphas = [0.1, 1.0, 10.0, 100.0]
coefficients = []
for alpha in alphas:
model = Ridge(alpha=alpha)
model.fit(X, y)
coefficients.append(model.coef_)
# Plot how coefficients change with alpha
plt.figure(figsize=(10, 6))
for i in range(X.shape[1]):
plt.plot(alphas, [c[i] for c in coefficients],
label=f'Feature {i+1}')
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.legend()
plt.title('Ridge Coefficients vs Alpha')
plt.show()

Real-World Example: House Price Prediction

Section titled “Real-World Example: House Price Prediction”
def house_price_ridge():
# Sample data with multiple features
data = {
'size': [1000, 1500, 1200, 1700, 2000],
'bedrooms': [2, 3, 2, 3, 4],
'age': [5, 10, 15, 8, 3],
'price': [200000, 300000, 250000, 350000, 450000]
}
# Prepare data
X = np.array([[s, b, a] for s, b, a in
zip(data['size'], data['bedrooms'], data['age'])])
y = np.array(data['price'])
# Compare different alphas
alphas = [0.1, 1.0, 10.0]
for alpha in alphas:
model = ridge_regression(X, y, alpha)
print(f"\nAlpha = {alpha}")
print("Size impact: ${:,.2f}".format(model.coef_[0]))
print("Bedroom impact: ${:,.2f}".format(model.coef_[1]))
print("Age impact: ${:,.2f}".format(model.coef_[2]))
from sklearn.model_selection import cross_val_score
def find_best_alpha(X, y, alphas=[0.1, 1.0, 10.0, 100.0]):
best_score = -float('inf')
best_alpha = None
for alpha in alphas:
model = Ridge(alpha=alpha)
scores = cross_val_score(model, X, y, cv=5)
avg_score = scores.mean()
print(f"Alpha {alpha}: Score = {avg_score:.3f}")
if avg_score > best_score:
best_score = avg_score
best_alpha = alpha
return best_alpha, best_score

Good For:

  • Many correlated features
  • All features might be important
  • Want to reduce coefficient size
  • Prevent overfitting

Not Good For:

  • Feature selection (use Lasso instead)
  • Very sparse data
  • When you need exactly zero coefficients
  • Keeps all features
  • Reduces impact of less important features
  • Need to scale features first
  • Choose alpha using cross-validation
  • More stable than Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
def ridge_workflow(X, y):
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge(alpha=1.0))
])
# Fit and predict
pipeline.fit(X, y)
return pipeline

Lasso regression helps select important features by setting some coefficients to exactly zero. Think of it as a feature selector that removes less important variables.

  1. Basic Formula
    Cost = MSE + α * (sum of absolute coefficients)
    • MSE: Regular error term
    • α (alpha): Controls feature selection
    • Higher α = more coefficients become zero
from sklearn.linear_model import Lasso
import numpy as np
def lasso_regression(X, y, alpha=1.0):
# Create and fit model
model = Lasso(alpha=alpha)
model.fit(X, y)
return model
# Example usage
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.array([2, 3, 4, 5])
model = lasso_regression(X, y)
print("Coefficients:", model.coef_)
def plot_lasso_path(X, y):
alphas = np.logspace(-4, 1, 100)
coefs = []
for alpha in alphas:
model = Lasso(alpha=alpha)
model.fit(X, y)
coefs.append(model.coef_)
# Plot coefficient paths
plt.figure(figsize=(10, 6))
for feature_idx in range(X.shape[1]):
plt.plot(alphas, [c[feature_idx] for c in coefs],
label=f'Feature {feature_idx+1}')
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.legend()
plt.title('Lasso Path: Coefficients vs Alpha')
plt.show()
def house_price_lasso():
# Sample data with many features
data = {
'size': [1000, 1500, 1200, 1700, 2000],
'bedrooms': [2, 3, 2, 3, 4],
'age': [5, 10, 15, 8, 3],
'bathrooms': [1, 2, 1, 2, 2],
'garage': [1, 1, 0, 2, 2],
'price': [200000, 300000, 250000, 350000, 450000]
}
# Prepare data
features = ['size', 'bedrooms', 'age', 'bathrooms', 'garage']
X = np.array([[data[f][i] for f in features]
for i in range(len(data['price']))])
y = np.array(data['price'])
# Try different alphas
alphas = [0.1, 1.0, 10.0]
for alpha in alphas:
model = lasso_regression(X, y, alpha)
print(f"\nAlpha = {alpha}")
for feature, coef in zip(features, model.coef_):
if abs(coef) > 0: # Only show non-zero coefficients
print(f"{feature}: ${coef:,.2f}")
def identify_important_features(X, y, feature_names, alpha=1.0):
# Fit Lasso
model = Lasso(alpha=alpha)
model.fit(X, y)
# Get non-zero coefficients
important_features = []
for name, coef in zip(feature_names, model.coef_):
if abs(coef) > 0:
important_features.append((name, coef))
# Sort by absolute coefficient value
important_features.sort(key=lambda x: abs(x[1]), reverse=True)
return important_features
# Example usage
features = ['size', 'bedrooms', 'age', 'bathrooms', 'garage']
important = identify_important_features(X, y, features)
for feature, impact in important:
print(f"{feature}: ${impact:,.2f}")

Good For:

  • Feature selection
  • Many irrelevant features
  • Want simpler models
  • Need to identify key variables

Not Good For:

  • Correlated features (use Ridge)
  • When all features matter
  • Small datasets
  • Eliminates unimportant features
  • Produces sparse models
  • Scale features before using
  • Try multiple alpha values
  • Good for feature selection
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
def lasso_workflow(X, y, alpha=1.0):
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('lasso', Lasso(alpha=alpha))
])
# Find best alpha using cross-validation
alphas = [0.1, 1.0, 10.0]
best_alpha, best_score = find_best_alpha(X, y, alphas)
# Update pipeline with best alpha
pipeline.set_params(lasso__alpha=best_alpha)
pipeline.fit(X, y)
return pipeline, best_alpha

Elastic Net combines Ridge and Lasso regression to get the best of both worlds. It can both select features and handle correlated variables.

  1. Basic Formula
    Cost = MSE + α * (r * L1 + (1-r) * L2)
    • MSE: Regular error term
    • α: Overall penalty strength
    • r: Mix ratio (1 = Lasso, 0 = Ridge)
    • L1: Sum of absolute coefficients (Lasso)
    • L2: Sum of squared coefficients (Ridge)
from sklearn.linear_model import ElasticNet
import numpy as np
def elastic_net(X, y, alpha=1.0, l1_ratio=0.5):
# Create and fit model
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
model.fit(X, y)
return model
# Example usage
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.array([2, 3, 4, 5])
model = elastic_net(X, y)
print("Coefficients:", model.coef_)
def house_price_elastic():
# Sample data
data = {
'size': [1000, 1500, 1200, 1700, 2000],
'bedrooms': [2, 3, 2, 3, 4],
'age': [5, 10, 15, 8, 3],
'bathrooms': [1, 2, 1, 2, 2],
'price': [200000, 300000, 250000, 350000, 450000]
}
# Prepare data
features = ['size', 'bedrooms', 'age', 'bathrooms']
X = np.array([[data[f][i] for f in features]
for i in range(len(data['price']))])
y = np.array(data['price'])
# Try different combinations
alphas = [0.1, 1.0]
l1_ratios = [0.2, 0.5, 0.8]
for alpha in alphas:
for l1_ratio in l1_ratios:
model = elastic_net(X, y, alpha, l1_ratio)
print(f"\nAlpha={alpha}, L1 ratio={l1_ratio}")
for feature, coef in zip(features, model.coef_):
print(f"{feature}: ${coef:,.2f}")
from sklearn.model_selection import GridSearchCV
def find_best_params(X, y):
# Parameter grid
param_grid = {
'alpha': [0.1, 0.5, 1.0],
'l1_ratio': [0.1, 0.5, 0.7, 0.9]
}
# Create model
model = ElasticNet()
# Grid search
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, y)
print("Best parameters:", grid.best_params_)
return grid.best_estimator_
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
def elastic_net_pipeline(X, y):
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('elastic', ElasticNet())
])
# Parameter grid
param_grid = {
'elastic__alpha': [0.1, 1.0, 10.0],
'elastic__l1_ratio': [0.1, 0.5, 0.9]
}
# Find best parameters
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X, y)
return grid.best_estimator_

Good For:

  • Correlated features
  • Feature selection needed
  • Want balance between Ridge and Lasso
  • Medium to large datasets

Not Good For:

  • Very small datasets
  • When you need simple interpretation
  • When pure Ridge or Lasso works well
  • Combines Ridge and Lasso benefits
  • More flexible than either alone
  • Two parameters to tune (α and r)
  • Scale features before using
  • Good default choice for regression
  1. Start with l1_ratio = 0.5
  2. Try different alpha values
  3. Use cross-validation
  4. Scale your features
  5. Check feature importance

Cross-validation helps test how well your model works on new data by splitting your data in different ways.

from sklearn.model_selection import KFold
import numpy as np
def k_fold_example(X, y, k=5):
# Create K-Fold splitter
kf = KFold(n_splits=k, shuffle=True)
scores = []
for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
# Split data
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train and evaluate
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
scores.append(score)
print(f"Fold {fold+1} Score: {score:.3f}")
print(f"Average Score: {np.mean(scores):.3f}")
from sklearn.model_selection import LeaveOneOut
def leave_one_out_example(X, y):
# Good for small datasets
loo = LeaveOneOut()
scores = []
for train_idx, test_idx in loo.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
scores.append(score)
return np.mean(scores)
from sklearn.model_selection import StratifiedKFold
def stratified_kfold_example(X, y, k=5):
# Good for imbalanced classification
skf = StratifiedKFold(n_splits=k, shuffle=True)
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Check distribution
print(f"Fold {fold+1} distribution:")
print(f"Train: {np.bincount(y_train)}")
print(f"Test: {np.bincount(y_test)}\n")
from sklearn.model_selection import TimeSeriesSplit
def time_series_split_example(X, y, n_splits=5):
# Good for time series data
tscv = TimeSeriesSplit(n_splits=n_splits)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
print(f"Fold {fold+1}:")
print(f"Train: index {min(train_idx)} to {max(train_idx)}")
print(f"Test: index {min(test_idx)} to {max(test_idx)}\n")
def compare_cv_methods(X, y):
# Sample data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([2, 4, 5, 4, 5, 6, 7, 6, 8, 9])
# 1. K-Fold
print("K-Fold CV:")
k_fold_example(X, y)
# 2. Leave-One-Out
print("\nLeave-One-Out CV:")
loo_score = leave_one_out_example(X, y)
print(f"Score: {loo_score:.3f}")
# 3. Time Series
print("\nTime Series CV:")
time_series_split_example(X, y)
  1. K-Fold (Default Choice)

    • General purpose
    • Medium to large datasets
    • Random data order
  2. Leave-One-Out

    • Very small datasets
    • When you need exact results
    • Computationally expensive
  3. Stratified K-Fold

    • Classification problems
    • Imbalanced classes
    • Need to maintain class ratios
  4. Time Series Split

    • Time series data
    • Sequential data
    • When order matters
from sklearn.model_selection import cross_val_score
def quick_cv(model, X, y, cv_type='kfold', n_splits=5):
if cv_type == 'kfold':
cv = KFold(n_splits=n_splits, shuffle=True)
elif cv_type == 'loo':
cv = LeaveOneOut()
elif cv_type == 'stratified':
cv = StratifiedKFold(n_splits=n_splits, shuffle=True)
elif cv_type == 'timeseries':
cv = TimeSeriesSplit(n_splits=n_splits)
scores = cross_val_score(model, X, y, cv=cv)
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
  • Always shuffle data (except time series)
  • Use stratified for classification
  • K-Fold is good default choice
  • Consider data size and type
  • Check score distribution