Energy Consumption Forecasting with XGBoost

This document outlines the development pipeline for an energy consumption forecasting model built using XGBoost and time series analysis. The project processes hourly energy consumption data, applies feature engineering, and uses time-based cross-validation to predict future energy usage.

Table of Contents

Data Preprocessing

The dataset, sourced from Kaggle, contains hourly energy consumption data (PJME_MW) for the PJM Interconnection. The data is preprocessed to ensure quality and compatibility with time series modeling.

Copy Code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('dataset/PJME_hourly.csv')
df = df.set_index('Datetime')
df.index = pd.to_datetime(df.index)

# Visualize raw data
df.plot(style='.', figsize=(12,6), color=sns.color_palette()[1], title='Hourly Energy Consumption')
plt.show()

# Remove outliers
df = df.query('PJME_MW > 19000').copy()

# Train-test split
train = df.loc[df.index < '01-01-2014']
test = df.loc[df.index >= '01-01-2014']

fig, ax = plt.subplots(figsize=(12,6))
train.plot(ax=ax, title='Training Set')
test.plot(ax=ax, title='Test Set')
ax.axvline('01-01-2014', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()

Additional Considerations:

Note: Visualizing outliers and splits ensures the data is correctly preprocessed before modeling.

Feature Engineering

Feature engineering is critical for capturing temporal patterns in energy consumption. Two main types of features are created:

Copy Code

def create_features(df):
    df = df.copy()
    df['hour'] = df.index.hour
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    df['weekofyear'] = df.index.isocalendar().week
    return df

def add_lags(df):
    target_map = df['PJME_MW'].to_dict()
    df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map)
    df['lag2'] = (df.index - pd.Timedelta('728 days')).map(target_map)
    df['lag3'] = (df.index - pd.Timedelta('1092 days')).map(target_map)
    return df

df = create_features(df)
df = add_lags(df)

Insights:

Model Architecture

The model uses an XGBoost Regressor, a gradient-boosting framework optimized for regression tasks. Key parameters include:

Copy Code

import xgboost as xgb

reg = xgb.XGBRegressor(
    base_score=0.5,
    booster='gbtree',
    n_estimators=1000,
    early_stopping_rounds=50,
    objective='reg:linear',
    max_depth=3,
    learning_rate=0.01
)

Architecture Insights:

Training

The model is trained using time-based cross-validation with TimeSeriesSplit (5–6 folds, 1-year test size, 30-day gap) to respect temporal order. The final model is trained on the entire dataset for future predictions.

Copy Code

from sklearn.model_selection import TimeSeriesSplit

tss = TimeSeriesSplit(n_splits=5, test_size=24*365*1, gap=24)
FEATURES = ['dayofyear', 'hour', 'dayofweek', 'quarter', 'month', 'year', 'lag1', 'lag2', 'lag3']
TARGET = 'PJME_MW'

for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]
    train = create_features(train)
    test = create_features(test)
    X_train = train[FEATURES]
    y_train = train[TARGET]
    X_test = test[FEATURES]
    y_test = test[TARGET]
    reg.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=100)

Training Details:

Evaluation

The model’s performance is evaluated using the root mean squared error (RMSE) across cross-validation folds. The average RMSE is reported, and future predictions are generated for a one-year horizon (2018-08-03 to 2019-08-01).

Copy Code

from sklearn.metrics import mean_squared_error
import numpy as np

preds = []
scores = []
for train_idx, val_idx in tss.split(df):
    # ... (training code)
    y_pred = reg.predict(X_test)
    score = np.sqrt(mean_squared_error(y_test, y_pred))
    scores.append(score)

print(f'Score across folds {np.mean(scores):0.4f}')
print(f'Fold scores: {scores}')

# Future predictions
future = pd.date_range('2018-08-03', '2019-08-01', freq='1h')
future_df = pd.DataFrame(index=future)
future_df['isFuture'] = True
df['isFuture'] = False
df_and_future = pd.concat([df, future_df])
df_and_future = create_features(df_and_future)
df_and_future = add_lags(df_and_future)
future_w_features = df_and_future.query('isFuture').copy()
future_w_features['pred'] = reg.predict(future_w_features[FEATURES])

future_w_features['pred'].plot(figsize=(10, 5), color=sns.color_palette()[4], title='Future Predictions')
plt.show()

Evaluation Insights:

Best Practices and Future Improvements

Best Practices:

Future Improvements:

Dependencies: Key libraries include xgboost, pandas, numpy, matplotlib, seaborn, scikit-learn, rich, and kagglehub. See pyproject.toml for details.

Dataset: Hourly energy consumption data from PJM Interconnection, available via main.py using KaggleHub.

Author: @frosty-8
License: MIT

© 2025 Energy Consumption Forecasting Project