Energy Consumption Forecasting with XGBoost

This document outlines the development pipeline for an energy consumption forecasting model built using XGBoost and time series analysis. The project processes hourly energy consumption data, applies feature engineering, and uses time-based cross-validation to predict future energy usage.

Data Preprocessing
Feature Engineering
Model Architecture
Training
Evaluation
Best Practices and Future Improvements

Data Preprocessing

The dataset, sourced from Kaggle, contains hourly energy consumption data (PJME_MW) for the PJM Interconnection. The data is preprocessed to ensure quality and compatibility with time series modeling.

Loading Data: The dataset is loaded using pandas, with the 'Datetime' column set as the index and converted to a datetime format.
Outlier Removal: Entries with energy consumption below 19,000 MW are identified as outliers and removed to improve model robustness.
Train-Test Split: The data is split into training (before January 1, 2014) and testing (on or after January 1, 2014) sets for evaluation.

Copy Code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('dataset/PJME_hourly.csv')
df = df.set_index('Datetime')
df.index = pd.to_datetime(df.index)

# Visualize raw data
df.plot(style='.', figsize=(12,6), color=sns.color_palette()[1], title='Hourly Energy Consumption')
plt.show()

# Remove outliers
df = df.query('PJME_MW > 19000').copy()

# Train-test split
train = df.loc[df.index < '01-01-2014']
test = df.loc[df.index >= '01-01-2014']

fig, ax = plt.subplots(figsize=(12,6))
train.plot(ax=ax, title='Training Set')
test.plot(ax=ax, title='Test Set')
ax.axvline('01-01-2014', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()

Additional Considerations:

Data Integrity: Verify timestamp continuity to handle missing or duplicate entries.
Visualization: Plotting histograms and time series helps identify anomalies and trends.
Scalability: For larger datasets, consider using Dask or chunked processing to handle memory constraints.

Note: Visualizing outliers and splits ensures the data is correctly preprocessed before modeling.

Feature Engineering

Feature engineering is critical for capturing temporal patterns in energy consumption. Two main types of features are created:

Temporal Features: Extracted from the datetime index, including hour, day of week, month, quarter, year, day of year, day of month, and week of year.
Lag Features: Historical consumption values from 364, 728, and 1092 days prior to capture yearly seasonality.

Copy Code

def create_features(df):
    df = df.copy()
    df['hour'] = df.index.hour
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    df['weekofyear'] = df.index.isocalendar().week
    return df

def add_lags(df):
    target_map = df['PJME_MW'].to_dict()
    df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map)
    df['lag2'] = (df.index - pd.Timedelta('728 days')).map(target_map)
    df['lag3'] = (df.index - pd.Timedelta('1092 days')).map(target_map)
    return df

df = create_features(df)
df = add_lags(df)

Insights:

Temporal Features: Capture cyclical patterns like daily or seasonal trends.
Lag Features: Account for long-term dependencies, crucial for energy consumption forecasting.
Improvements: Consider rolling statistics (e.g., moving averages) or additional lags for finer granularity.

Model Architecture

The model uses an XGBoost Regressor, a gradient-boosting framework optimized for regression tasks. Key parameters include:

n_estimators: 500–1000 (tuned for performance).
max_depth: 3 (limits tree complexity to prevent overfitting).
learning_rate: 0.01 (slow learning for better convergence).
early_stopping_rounds: 50 (halts training if performance plateaus).
objective: reg:linear (for regression).

Copy Code

import xgboost as xgb

reg = xgb.XGBRegressor(
    base_score=0.5,
    booster='gbtree',
    n_estimators=1000,
    early_stopping_rounds=50,
    objective='reg:linear',
    max_depth=3,
    learning_rate=0.01
)

Architecture Insights:

Gradient Boosting: Combines weak learners (decision trees) to model complex relationships.
Feature Importance: Analyze feature importance to identify key predictors (e.g., lag features).
Improvements: Experiment with hyperparameters using grid search or Bayesian optimization.

Training

The model is trained using time-based cross-validation with TimeSeriesSplit (5–6 folds, 1-year test size, 30-day gap) to respect temporal order. The final model is trained on the entire dataset for future predictions.

Copy Code

from sklearn.model_selection import TimeSeriesSplit

tss = TimeSeriesSplit(n_splits=5, test_size=24*365*1, gap=24)
FEATURES = ['dayofyear', 'hour', 'dayofweek', 'quarter', 'month', 'year', 'lag1', 'lag2', 'lag3']
TARGET = 'PJME_MW'

for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]
    train = create_features(train)
    test = create_features(test)
    X_train = train[FEATURES]
    y_train = train[TARGET]
    X_test = test[FEATURES]
    y_test = test[TARGET]
    reg.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=100)

Training Details:

TimeSeriesSplit: Ensures no future data leaks into training folds.
Early Stopping: Prevents overfitting by monitoring validation loss.
Visualization: Plots of training and testing folds help validate the split strategy.

Evaluation

The model’s performance is evaluated using the root mean squared error (RMSE) across cross-validation folds. The average RMSE is reported, and future predictions are generated for a one-year horizon (2018-08-03 to 2019-08-01).

Copy Code

from sklearn.metrics import mean_squared_error
import numpy as np

preds = []
scores = []
for train_idx, val_idx in tss.split(df):
    # ... (training code)
    y_pred = reg.predict(X_test)
    score = np.sqrt(mean_squared_error(y_test, y_pred))
    scores.append(score)

print(f'Score across folds {np.mean(scores):0.4f}')
print(f'Fold scores: {scores}')

# Future predictions
future = pd.date_range('2018-08-03', '2019-08-01', freq='1h')
future_df = pd.DataFrame(index=future)
future_df['isFuture'] = True
df['isFuture'] = False
df_and_future = pd.concat([df, future_df])
df_and_future = create_features(df_and_future)
df_and_future = add_lags(df_and_future)
future_w_features = df_and_future.query('isFuture').copy()
future_w_features['pred'] = reg.predict(future_w_features[FEATURES])

future_w_features['pred'].plot(figsize=(10, 5), color=sns.color_palette()[4], title='Future Predictions')
plt.show()

Evaluation Insights:

RMSE: Provides a measure of prediction accuracy in the same units as the target variable.
Future Predictions: Visualized to assess forecast trends and anomalies.
Additional Metrics: Consider mean absolute error (MAE) or mean absolute percentage error (MAPE) for further insights.

Best Practices and Future Improvements

Best Practices:

Data Quality: Ensure clean, continuous time series data with no missing timestamps.
Feature Engineering: Include domain-specific features (e.g., weather data) for better predictions.
Cross-Validation: Use time-based splits to maintain temporal integrity.
Visualization: Leverage tools like Matplotlib and Seaborn for data exploration and result validation.
Model Interpretability: Use SHAP or feature importance plots to understand model decisions.

Future Improvements:

Hyperparameter Tuning: Use tools like Optuna or GridSearchCV for optimal XGBoost parameters.
Additional Features: Incorporate external factors like temperature or holidays.
Alternative Models: Experiment with LSTM or Prophet for time series forecasting.
Ensemble Methods: Combine XGBoost with other models for improved accuracy.
Scalability: Optimize for larger datasets using distributed computing frameworks.

Dependencies: Key libraries include xgboost, pandas, numpy, matplotlib, seaborn, scikit-learn, rich, and kagglehub. See pyproject.toml for details.

Dataset: Hourly energy consumption data from PJM Interconnection, available via main.py using KaggleHub.

Author: @frosty-8
License: MIT

Energy Consumption Forecasting with XGBoost

Table of Contents

Data Preprocessing

Feature Engineering

Model Architecture

Training

Evaluation

Best Practices and Future Improvements