This document outlines the development pipeline for an energy consumption forecasting model built using XGBoost and time series analysis. The project processes hourly energy consumption data, applies feature engineering, and uses time-based cross-validation to predict future energy usage.
The dataset, sourced from Kaggle, contains hourly energy consumption data (PJME_MW) for the PJM Interconnection. The data is preprocessed to ensure quality and compatibility with time series modeling.
Copy Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('dataset/PJME_hourly.csv')
df = df.set_index('Datetime')
df.index = pd.to_datetime(df.index)
# Visualize raw data
df.plot(style='.', figsize=(12,6), color=sns.color_palette()[1], title='Hourly Energy Consumption')
plt.show()
# Remove outliers
df = df.query('PJME_MW > 19000').copy()
# Train-test split
train = df.loc[df.index < '01-01-2014']
test = df.loc[df.index >= '01-01-2014']
fig, ax = plt.subplots(figsize=(12,6))
train.plot(ax=ax, title='Training Set')
test.plot(ax=ax, title='Test Set')
ax.axvline('01-01-2014', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()
Additional Considerations:
Note: Visualizing outliers and splits ensures the data is correctly preprocessed before modeling.
Feature engineering is critical for capturing temporal patterns in energy consumption. Two main types of features are created:
Copy Code
def create_features(df):
df = df.copy()
df['hour'] = df.index.hour
df['dayofweek'] = df.index.dayofweek
df['quarter'] = df.index.quarter
df['month'] = df.index.month
df['year'] = df.index.year
df['dayofyear'] = df.index.dayofyear
df['dayofmonth'] = df.index.day
df['weekofyear'] = df.index.isocalendar().week
return df
def add_lags(df):
target_map = df['PJME_MW'].to_dict()
df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map)
df['lag2'] = (df.index - pd.Timedelta('728 days')).map(target_map)
df['lag3'] = (df.index - pd.Timedelta('1092 days')).map(target_map)
return df
df = create_features(df)
df = add_lags(df)
Insights:
The model uses an XGBoost Regressor, a gradient-boosting framework optimized for regression tasks. Key parameters include:
Copy Code
import xgboost as xgb
reg = xgb.XGBRegressor(
base_score=0.5,
booster='gbtree',
n_estimators=1000,
early_stopping_rounds=50,
objective='reg:linear',
max_depth=3,
learning_rate=0.01
)
Architecture Insights:
The model is trained using time-based cross-validation with TimeSeriesSplit (5–6 folds, 1-year test size, 30-day gap) to respect temporal order. The final model is trained on the entire dataset for future predictions.
Copy Code
from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit(n_splits=5, test_size=24*365*1, gap=24)
FEATURES = ['dayofyear', 'hour', 'dayofweek', 'quarter', 'month', 'year', 'lag1', 'lag2', 'lag3']
TARGET = 'PJME_MW'
for train_idx, val_idx in tss.split(df):
train = df.iloc[train_idx]
test = df.iloc[val_idx]
train = create_features(train)
test = create_features(test)
X_train = train[FEATURES]
y_train = train[TARGET]
X_test = test[FEATURES]
y_test = test[TARGET]
reg.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=100)
Training Details:
The model’s performance is evaluated using the root mean squared error (RMSE) across cross-validation folds. The average RMSE is reported, and future predictions are generated for a one-year horizon (2018-08-03 to 2019-08-01).
Copy Code
from sklearn.metrics import mean_squared_error
import numpy as np
preds = []
scores = []
for train_idx, val_idx in tss.split(df):
# ... (training code)
y_pred = reg.predict(X_test)
score = np.sqrt(mean_squared_error(y_test, y_pred))
scores.append(score)
print(f'Score across folds {np.mean(scores):0.4f}')
print(f'Fold scores: {scores}')
# Future predictions
future = pd.date_range('2018-08-03', '2019-08-01', freq='1h')
future_df = pd.DataFrame(index=future)
future_df['isFuture'] = True
df['isFuture'] = False
df_and_future = pd.concat([df, future_df])
df_and_future = create_features(df_and_future)
df_and_future = add_lags(df_and_future)
future_w_features = df_and_future.query('isFuture').copy()
future_w_features['pred'] = reg.predict(future_w_features[FEATURES])
future_w_features['pred'].plot(figsize=(10, 5), color=sns.color_palette()[4], title='Future Predictions')
plt.show()
Evaluation Insights:
Best Practices:
Future Improvements:
Dependencies: Key libraries include xgboost
, pandas
, numpy
, matplotlib
, seaborn
, scikit-learn
, rich
, and kagglehub
. See pyproject.toml
for details.
Dataset: Hourly energy consumption data from PJM Interconnection, available via main.py
using KaggleHub.
Author: @frosty-8
License: MIT
© 2025 Energy Consumption Forecasting Project