Forecasting Paediatric Emergency Department Attendances¶

HPDM097 – Making a Difference with Health Data

Executive Summary¶

This project helps an NHS trust forecast the daily number of paediatric emergency department attendances over the next 28 days.

Analysis of historical attendance data showed patterns of higher attendances on certain days of the week, and at certain times of the year. There was substantial day-to-day variabilty in attendances, but no evidence of a sustained long-term trend.

The following forecasting approaches were evaluated.

  1. Simple naïve methods were first assessed to establish a baseline for comparison. These were then compared with more advanced statistical approaches:

  2. SARIMAX, which captures autocorrelation and seasonal structure through autoregressive and moving-average components.

  3. Prophet, an additive regression model that sums trend, seasonality, and changepoint effects.

Forecast accuracy was assessed using standard error metrics across the full 28-day forecast horizon, using rolling-origin cross-validation.

Prophet consistently outperformed both the naïve benchmarks and the SARIMAX model. On this basis, Prophet was selected for final forecasting.

A final 28-day forecast with prediction intervals was produced to support staffing decisions over the next four weeks. This forecast predicts rising demand over the 4 week period, and recommendations are made on this basis.

1.1 Introduction¶

Paediatric emergency attendances are influenced by factors which can make demand variable and difficult to predict. These include seasonal patterns over the months of the year, as well as patterns over the week, and variations due to the school term calendar. Demand forecasting support service delivery, by producing short-term forecasts which can help hospitals plan staffing levels and manage operational pressures.

1.2 Objectives¶

The objectives of this project are to:

  • Explore patterns in paediatric emergency department attendances
  • Evaluate naïve and statistical forecasting methods for short-term demand
  • Select a suitable forecasting approach
  • Produce a reproducible 28-day forecast of paediatric ED attendances

2 Setup¶

This analysis was conducted in Python using the HDS_forecast environment provided by Exeter University for this module. All results are fully reproducible by running the notebook from top to bottom, within the HDS_forecast environment.

Alternatively, a matching environment can be created using the provided demand_forecast.yml file:

conda env create -f demand_forecast.yml

conda activate demand_forecast

The historical attendance data is included in the file paediatrics_train.csv

In [23]:
# ------------------------------------------------------------
# 2. Import libraries and data path
# ------------------------------------------------------------

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prophet import Prophet
from pmdarima import auto_arima
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
import logging
from IPython.display import display

# Path to data file
DATA_FILE = "paediatrics_train.csv"

# Silence warnings and logs to keep the notebook clean
warnings.filterwarnings("ignore")
logging.disable(logging.CRITICAL)

3. Data Description and Initial Analysis¶

The dataset consists of counts of paediatric emergency department attendances between the April 2014 and February 2017. Observations are recorded at daily frequency.

In [24]:
# ------------------------------------------------------------
# 3.1 Load and inspect data
# ------------------------------------------------------------

df = pd.read_csv(DATA_FILE, parse_dates=["date"], index_col="date")
df = df.asfreq("D")

print(df.head(), "\n")
print(df.info())
            paed_ed_attends
date                       
2014-04-01               47
2014-04-02               46
2014-04-03               47
2014-04-04               48
2014-04-05               52 

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1056 entries, 2014-04-01 to 2017-02-19
Freq: D
Data columns (total 1 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   paed_ed_attends  1056 non-null   int64
dtypes: int64(1)
memory usage: 16.5 KB
None

3.2 Data Quality Checks¶

The data was checked for missing values, duplicate or missing dates, negative or non integer counts. None were identified.

In [25]:
# ------------------------------------------------------------
# 3.2 Data Quality Checks
# ------------------------------------------------------------

# Missing values
missing_values = df.isna().sum()

# Duplicate dates
duplicate_dates = df.index.duplicated().sum()

# Check for continuous daily frequency
full_date_range = pd.date_range(start=df.index.min(), end=df.index.max(), freq="D")
missing_dates = full_date_range.difference(df.index)

# Validity of attendance counts
negative_counts = (df["paed_ed_attends"] < 0).sum()
non_integer_counts = (df["paed_ed_attends"] % 1 != 0).sum()

# Display results
print("Missing values per column:")
print(missing_values, "\n")

print(f"Number of duplicate dates: {duplicate_dates}")
print(f"Number of missing dates in daily sequence: {len(missing_dates)}")
print(f"Negative attendance values: {negative_counts}")
print(f"Non-integer attendance values: {non_integer_counts}")
Missing values per column:
paed_ed_attends    0
dtype: int64 

Number of duplicate dates: 0
Number of missing dates in daily sequence: 0
Negative attendance values: 0
Non-integer attendance values: 0

3.3 Train Test Split¶

The final 28 days of the available time series were reserved as a hold-out test set. This test set represents the future period for which staffing decisions would need to be made in practice.

The test data were not used during exploratory analysis, model fitting, or model selection. All model development and comparison were carried out using the remaining historical data (the training set), ensuring that forecast performance is evaluated on genuinely unseen observations.

In [26]:
# ------------------------------------------------------------
# 3.3 Train-Test split
# ------------------------------------------------------------

HORIZON = 28

ts_full = df["paed_ed_attends"].astype(float)
ts_train = ts_full.iloc[:-HORIZON]
ts_test = ts_full.iloc[-HORIZON:]

# Print summary
print(f"Total observations: {len(ts_full)}")
print(f"Training observations: {len(ts_train)}")
print(f"Test observations: {len(ts_test)}")

print("\nTraining period:")
print(f"  {ts_train.index.min().date()} to {ts_train.index.max().date()}")

print("\nTest period (held out):")
print(f"  {ts_test.index.min().date()} to {ts_test.index.max().date()}")
Total observations: 1056
Training observations: 1028
Test observations: 28

Training period:
  2014-04-01 to 2017-01-22

Test period (held out):
  2017-01-23 to 2017-02-19

3.4 Overview of Time Series Behaviour¶

Attendances show day-to-day variability with no strong long-term trend, but with periodic peaks which may suggest winter respiratory illnesses.

In [27]:
# ------------------------------------------------------------
# 3.4 Visualise the data
# ------------------------------------------------------------

# Line plot of training data only
plt.figure(figsize=(12, 4))
plt.plot(ts_train.index, ts_train.values)
plt.xlabel("Date")
plt.ylabel("Daily paediatric ED attendances")
plt.title("Daily paediatric ED attendances over time (training data only)")
plt.tight_layout()
plt.show()
No description has been provided for this image

3.5 Seasonal effects¶

To explore longer-term seasonal effects, mean daily attendances per month were aggregated across years. Attendances show day-to-day variability with no strong long-term trend. Peaks in November and March suggest seasonal effects consistent with winter respiratory illness.

In [28]:
# ------------------------------------------------------------
# 3.5.1 Visualise seasonality by calendar month
# ------------------------------------------------------------

# Convert series to DataFrame for feature engineering
train_month = ts_train.to_frame(name="paed_ed_attends")

# Add month information
train_month["month"] = train_month.index.month
train_month["month_name"] = train_month.index.month_name()

# Calculate mean daily attendances per calendar month
monthly_mean = (
    train_month.groupby(["month", "month_name"])["paed_ed_attends"]
    .mean()
    .reset_index()
    .sort_values("month")
)

# Plot mean daily attendances by month
plt.figure(figsize=(8, 4))
plt.plot(monthly_mean["month_name"], monthly_mean["paed_ed_attends"], marker="o")
plt.xlabel("Month")
plt.ylabel("Mean daily attendances")
plt.title("Mean daily paediatric ED attendances by month (training data)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

3.5.2 Short term effects¶

To examine variation within-week, attendances were grouped by day of week and summarised using a boxplot. A weekly pattern is suggested, with higher attendances on Sundays and Mondays.

In [29]:
# ------------------------------------------------------------
# 3.5.2 Visualise seasonality by day of week
# ------------------------------------------------------------

# Convert series to DataFrame
train_dayofweek = ts_train.to_frame(name="paed_ed_attends")

# Add day-of-week
train_dayofweek["day_name"] = train_dayofweek.index.day_name()

# Order for days of the week
weekday_order = [
    "Monday",
    "Tuesday",
    "Wednesday",
    "Thursday",
    "Friday",
    "Saturday",
    "Sunday",
]

train_dayofweek["day_name"] = pd.Categorical(
    train_dayofweek["day_name"], categories=weekday_order, ordered=True
)

# Boxplot by day of week
plt.figure(figsize=(8, 4))
train_dayofweek.boxplot(
    column="paed_ed_attends", by="day_name", grid=False, showfliers=False
)
plt.suptitle("")
plt.title("Paediatric ED attendances by day of week (training data)")
plt.xlabel("Day of week")
plt.ylabel("Daily attendances")
plt.tight_layout()
plt.show()
<Figure size 800x400 with 0 Axes>
No description has been provided for this image

3.6 Distribution and variability¶

The distribution of daily attendances was examined. Observations cluster around a central range. There are occasional higher-demand days. There are no extreme outliers. Inspection of rolling statistics suggests variability is broadly stable over time.

These characteristics support the use of absolute-error-based evaluation metrics such as mean absolute error (MAE)

In [30]:
# ------------------------------------------------------------
# 3.6.1 Visualise distribution of daily attendances
# ------------------------------------------------------------

# Histogram of daily attendances (distribution)
plt.figure(figsize=(8, 4))
plt.hist(ts_train, bins=30)
plt.xlabel("Daily attendances")
plt.ylabel("Frequency")
plt.title("Distribution of daily paediatric ED attendances (training data)")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [31]:
# ------------------------------------------------------------
# 3.6.2 Visualise rolling mean and variability
# ------------------------------------------------------------

# Rolling mean and variability of daily attendances (line plot)
rolling_mean = ts_train.rolling(30).mean()
rolling_std = ts_train.rolling(30).std()

plt.figure(figsize=(12, 4))
plt.plot(rolling_mean, label="30-day rolling mean")
plt.plot(rolling_std, label="30-day rolling std")
plt.legend()
plt.title("Rolling mean and variability (training data)")
plt.tight_layout()
plt.show()
No description has been provided for this image

3.7 Implications for Forecasting¶

In summary, daily paediatric emergency department attendances show clear seasonal patterns. Winter months tend to be busier than summer months, and across the week attendances are highest on Sundays and Mondays. Day-to-day attendance numbers vary considerably, but this variability appears broadly stable over time. There is no clear evidence of a sustained long-term increase or decrease in attendances over the study period. Taken together, these findings suggest a forecasting approach which make uses of recently and seasonally similar days.

4. Forecasting Framework¶

This section outlines approach used to evaluate forecasting models.

4.1 Forecasting Task Definition¶

The forecasting task is to predict daily paediatric emergency department attendances over a 28-day horizon. This horizon reflects the Trust’s operational need to plan staffing levels several weeks in advance, balancing rota preparation with the inherent uncertainty of emergency care demand.

The primary objective of the forecasting exercise is therefore short-term accuracy, rather than long-term trend estimation. Forecasts are required at a daily resolution, and performance is assessed in terms of how closely predicted attendances match observed values over the forecast horizon.

4.2 Error Metrics¶

Forecast accuracy was evaluated using three error metrics. Mean Absolute Error (MAE) measures the average absolute difference between observed and forecast values, providing a direct and easily interpretable measure of typical forecast error in the original units of the data.

Root Mean Squared Error (RMSE) also measures forecast error in the original units, but squares errors before averaging, penalising larger errors more heavily. This makes RMSE sensitive to large deviations and useful when large forecast errors are undesirable, such as during periods of peak demand.

Symmetric Mean Absolute Percentage Error (sMAPE) expresses forecast error as a percentage of the average of observed and forecast values, allowing performance to be compared across different scales.

Considering these metrics together allows selection of a model that performs well across different aspects of forecast accuracy.

In [32]:
# ------------------------------------------------------------
# 4.2 Error metric functions
# ------------------------------------------------------------


def mae(y_true, y_pred):
    """
    Compute Mean Absolute Error (MAE).

    MAE measures the average absolute difference between observed
    and predicted values, expressed in the original units of the data.
    """
    return float(np.mean(np.abs(y_true - y_pred)))


def rmse(y_true, y_pred):
    """
    Compute Root Mean Squared Error (RMSE).

    RMSE penalises larger errors more heavily than MAE and is sensitive
    to occasional large forecast deviations.
    """
    return float(np.sqrt(np.mean((y_true - y_pred) ** 2)))


def smape(y_true, y_pred):
    """
    Compute Symmetric Mean Absolute Percentage Error (sMAPE).

    sMAPE provides a scale-free measure of forecast accuracy by expressing
    errors as a percentage of the average magnitude of observed and
    predicted values.
    """
    denom = (np.abs(y_true) + np.abs(y_pred)) / 2.0
    denom = np.where(denom == 0, 1e-8, denom)
    return float(np.mean(np.abs(y_true - y_pred) / denom) * 100)

4.3 Cross-Validation Strategy¶

Model performance was evaluated using rolling-origin cross-validation with an expanding training window. An initial training window of two years (730 days) was used to ensure that models were exposed to multiple seasonal cycles. At each forecast origin, models were trained on all available observations up to that date and used to generate a 28-day multi-step forecast.

The forecast origin was advanced forward by one day at each iteration (step = 1), producing a sequence of overlapping forecast evaluations. Although adjacent folds are highly correlated, this approach provides a smooth and comprehensive assessment of model performance over time and across all seasonal positions.

Cross-validation results were stored in long format, with one row per forecast origin and per forecast horizon step. This structure allows analysis of how accuracy evolves with increasing lead time.

Aggregate accuracy metrics (MAE, RMSE, and sMAPE) were subsequently derived from the long-format output by averaging errors across forecast horizons within each origin. This allowed both horizon-level and origin-level performance to be examined without repeated model refitting.

In [33]:
# ------------------------------------------------------------
# 4.3 Cross-Validation
# ------------------------------------------------------------

INITIAL_WINDOW = 365 * 2
STEP = 1


def rolling_origin_cv(
    series, horizon=28, initial_window=365 * 2, step=1, forecaster=None
):
    """
    Perform rolling-origin cross-validation using an expanding training window.

    At each forecast origin, the model is trained on all observations up to
    that origin and used to generate a multi-step forecast of length
    `horizon`. The forecast origin is then advanced forward by `step`
    observations and the process is repeated until the end of the series
    is reached.

    This function returns results in "long format", with one row per
    forecast origin and per forecast horizon step. This structure enables
    flexible analysis, including horizon-specific error evaluation
    and aggregation into summary metrics.

    Parameters
    ----------
    series : pd.Series
        Univariate time series indexed by datetime.
    horizon : int, default=28
        Number of future time steps to forecast at each origin.
    initial_window : int, default=365*2
        Number of observations used in the initial training window.
        The training window expands at each iteration.
    step : int, default=1
        Number of observations by which the forecast origin is advanced
        between successive folds.
    forecaster : callable
        Function of the form:
            forecaster(train_series, horizon) -> array-like
        Must return a forecast of length `horizon`.

    Returns
    -------
    pd.DataFrame
        Long-format cross-validation results with columns:
        - origin_date : first timestamp of the test window
        - h           : forecast lead time (1 to horizon)
        - y_true      : observed value
        - y_pred      : forecast value
        - abs_error   : absolute error at that horizon
        - sq_error    : squared error at that horizon
    """

    series = series.dropna()
    n = len(series)
    if initial_window + horizon > n:
        raise ValueError("initial_window + horizon > series length")

    results = []
    last_origin_start = n - horizon

    for origin_end in range(initial_window, last_origin_start + 1, step):
        train_series = series.iloc[:origin_end]
        test_slice = series.iloc[origin_end : origin_end + horizon]

        y_pred = np.asarray(forecaster(train_series, horizon), dtype=float)
        y_true = test_slice.values.astype(float)

        origin_date = test_slice.index[0]
        abs_err = np.abs(y_true - y_pred)
        sq_err = (y_true - y_pred) ** 2

        for h in range(1, horizon + 1):
            results.append(
                {
                    "origin_date": origin_date,
                    "h": h,
                    "y_true": y_true[h - 1],
                    "y_pred": y_pred[h - 1],
                    "abs_error": abs_err[h - 1],
                    "sq_error": sq_err[h - 1],
                }
            )

    return pd.DataFrame(results)


def cv_summarise(cv_long: pd.DataFrame) -> pd.DataFrame:
    """
    Create summary metrics from long-format cross-validation results.

    This function computes forecast accuracy metrics by
    averaging errors across all forecast horizons within each origin.
    It derives MAE, RMSE, and sMAPE from the error terms stored
    in the long-format output.

    Parameters
    ----------
    cv_long : pd.DataFrame
        Long-format cross-validation output generated by
        `rolling_origin_cv`.

    Returns
    -------
    pd.DataFrame
        Short summary with one row per forecast origin containing:
        - MAE   : mean absolute error across all horizons
        - RMSE  : root mean squared error across all horizons
        - sMAPE : symmetric mean absolute percentage error (%)
    """
    # Define sMAPE at the per-row level then average per origin.
    denom = np.abs(cv_long["y_true"]) + np.abs(cv_long["y_pred"])
    smape_row = np.where(
        denom == 0, 0.0, 200.0 * np.abs(cv_long["y_true"] - cv_long["y_pred"]) / denom
    )
    tmp = cv_long.assign(smape_row=smape_row)

    short = tmp.groupby("origin_date", as_index=False).agg(
        MAE=("abs_error", "mean"),
        RMSE=("sq_error", lambda x: float(np.sqrt(np.mean(x)))),
        sMAPE=("smape_row", "mean"),
    )
    return short

5. Benchmark Models (Naïve Methods)¶

5.1 Rationale for Benchmark Models¶

In time series forecasting, naïve methods often perform well, particularly when strong seasonal patterns are present. These methods provide a useful reference point for evaluating whether more complex models offer meaningful improvement.

 5.2.1 Naïve (Last Observation) Forecast¶

The naïve forecast assumes that future attendances will be equal to the most recently observed value.

5.2.2 Seasonal Naïve (Weekly) Forecast¶

The seasonal naïve forecast accounts for the observed weekly pattern in attendances. Forecasts are generated by repeating the values observed in the most recent week, so that, for example, a future Monday is forecast using the most recent Monday’s attendance.

5.3 Benchmark Model Implementation¶

This code defines the naïve and seasonal naïve forecasting functions, which take historical attendance data as input and return a 28-day forecast.

In [34]:
# ------------------------------------------------------------
# 5.3 Benchmark forecasters
# ------------------------------------------------------------


def forecast_naive1(train_series: pd.Series, horizon: int) -> np.ndarray:
    """
    Naive benchmark forecaster (NF1).

    Forecasts all future time steps as the last observed value in the
    training series.
    """
    last_val = float(train_series.iloc[-1])
    return np.repeat(last_val, horizon)


def forecast_snaive_weekly(train_series: pd.Series, horizon: int) -> np.ndarray:
    """
    Seasonal naive benchmark forecaster (weekly).

    Forecasts future values by repeating the most recent observed
    weekly pattern (last 7 days).
    """
    season_length = 7
    last_week = train_series.iloc[-season_length:].to_numpy(dtype=float)
    reps = int(np.ceil(horizon / season_length))
    return np.tile(last_week, reps)[:horizon]

5.4 Benchmark Model Evaluation¶

Benchmark models were evaluated using rolling-origin cross-validation, with accuracy assessed using mean absolute error (MAE) across the 28-day forecast horizon. The naïve and weekly seasonal naïve benchmarks performed almost identically, with very similar overall MAE (9.30 vs 9.32). The seasonal naïve model produced slightly smoother forecasts and narrower prediction intervals across lead times, and was adopted as the benchmark for comparison.

In [35]:
# ------------------------------------------------------------
# 5.4.1 Benchmark Model Evaluation (Rolling origin CV on training data)
# ------------------------------------------------------------
y_train = ts_train

cv_naive1 = rolling_origin_cv(
    series=y_train,
    initial_window=INITIAL_WINDOW,
    step=STEP,
    forecaster=forecast_naive1,
)

cv_snaive = rolling_origin_cv(
    series=y_train,
    initial_window=INITIAL_WINDOW,
    step=STEP, 
    forecaster=forecast_snaive_weekly,
)


# Overall MAE across all origins and horizons
summary = pd.DataFrame(
    {
        "Model": ["Naive1 (last value)", "Seasonal Naive (weekly)"],
        "Mean MAE (h=1..28)": [
            cv_naive1["abs_error"].mean(),
            cv_snaive["abs_error"].mean(),
        ],
    }
).sort_values("Mean MAE (h=1..28)")

display(summary)
Model Mean MAE (h=1..28)
0 Naive1 (last value) 9.308777
1 Seasonal Naive (weekly) 9.329863
In [36]:
# ------------------------------------------------------------
# 5.4.2 visualise horizon-wise accuracy
# ------------------------------------------------------------

# Horizon-wise MAE
mae_by_h_naive1 = cv_naive1.groupby("h")["abs_error"].mean().reset_index(name="MAE")
mae_by_h_snaive = cv_snaive.groupby("h")["abs_error"].mean().reset_index(name="MAE")

# Plot horizon-wise MAE
plt.figure(figsize=(8, 4))
plt.plot(mae_by_h_naive1["h"], mae_by_h_naive1["MAE"], label="Naive1")
plt.plot(
    mae_by_h_snaive["h"],
    mae_by_h_snaive["MAE"],
    label="Seasonal Naive (weekly)",
)
plt.xlabel("Forecast lead time (days ahead)")
plt.ylabel("Mean Absolute Error (MAE)")
plt.xticks(range(1, 29, 3))
plt.title("Benchmark accuracy by forecast horizon")
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

6. Candidate Forecasting Model¶

6.1 Choice of model¶

Three advanced forecasting approaches were initially considered. SARIMAX models temporal dependence and seasonality within a statistical time-series framework. Prophet is an additive regression model designed for operational forecasting, combining trend, seasonality, and changepoint detection. A neural network model was also explored, using lagged observations and calendar features to capture non-linear patterns.

The neural network was excluded after preliminary evaluation, as it showed poor and unstable out-of-sample performance- probably due to error accumulation in recursive multi-step forecasting. The analysis therefore focused on SARIMAX and Prophet.

6.2 SARIMAX¶

6.2.1 SARIMAX order selection¶

Model orders were selected using auto_arima, which identifies optimal non-seasonal and seasonal parameters by minimising an information criterion (AIC). This approach specifies the autoregressive (AR), differencing (I), and moving-average (MA) components while avoiding overfitting.

The selected specification was SARIMAX (1,1,2)(0,0,2)7​

The non-seasonal component (1,1,2) indicates:

  • d = 1: First-order differencing was applied to remove non-stationarity in the level of the series.

  • p = 1: A first-order autoregressive term captures short-term persistence in daily observations.

  • q = 2: Two moving-average terms account for short-term propagation in the residuals.

The seasonal component (0,0,2)7 introduces a weekly seasonal moving-average structure with period 7 days, capturing within-week patterns in demand.

In [37]:
# ------------------------------------------------------------
# 6.2.1 SARIMAX Order Selection Using auto-ARIMA
# ------------------------------------------------------------

y = ts_train.astype(float)

y0 = y.iloc[:INITIAL_WINDOW]

auto_model = auto_arima(
    y0,
    seasonal=True,
    m=7,
    stepwise=True,
    suppress_warnings=True,
    error_action="ignore",
)

ORDER = auto_model.order
SEASONAL_ORDER = auto_model.seasonal_order

print("Selected order:", ORDER)
print("Selected seasonal_order:", SEASONAL_ORDER)
Selected order: (1, 1, 2)
Selected seasonal_order: (0, 0, 2, 7)

6.2.2 SARIMAX forecasting function¶

The forecasting function refits the selected SARIMAX model to the available training data at each forecast origin.

Multi-step forecasts are generated recursively from the fitted autoregressive and seasonal structure. As the forecast horizon increases, predictions depend progressively on prior forecasts rather than observed data, resulting in increasing uncertainty and error.

In [38]:
# ------------------------------------------------------------
# 6.2.2 SARIMAX forecasting function
# ------------------------------------------------------------


def forecast_sarimax(
    train_series: pd.Series, horizon: int, order=ORDER, seasonal_order=SEASONAL_ORDER
) -> np.ndarray:
    """
    Fit a SARIMAX model to the training series and produce a multi-step forecast.

    The model is refitted at each call using the supplied training data and
    pre-selected ARIMA and seasonal orders. This function is intended for use
    within a rolling-origin cross-validation framework, where the training
    window expands over time.

    Parameters
    ----------
    train_series : pd.Series
        Univariate time series of daily paediatric ED attendances, indexed by date.
    horizon : int
        Number of days ahead to forecast.
    order : tuple
        Non-seasonal ARIMA order (p, d, q).
    seasonal_order : tuple
        Seasonal ARIMA order (P, D, Q, m), where m is the seasonal period.

    Returns
    -------
    np.ndarray
        Array of length `horizon` containing the point forecasts.
    """
    model = SARIMAX(
        train_series.astype(float),
        order=order,
        seasonal_order=seasonal_order,
        enforce_stationarity=False,
        enforce_invertibility=False,
    )
    fit = model.fit(disp=False)
    preds = fit.forecast(steps=horizon)
    return preds.to_numpy(dtype=float)

6.3 Prophet¶

Prophet is a decomposable time series model that represents the observed series as the sum of trend, seasonal, and irregular components. The model specifies a piecewise linear trend with automatically selected changepoints and models weekly and yearly seasonality using Fourier series terms. This structure allows the model to capture gradual structural shifts and flexible cyclical patterns without requiring strict stationarity assumptions.

For cross-validation, the training data were reformatted into Prophet’s required structure (with ds and y columns), and the model was refitted at each forecast origin using an expanding training window. Multi-step forecasts were then generated directly from the estimated trend and seasonal components for the subsequent 28 days.

This approach enables Prophet to produce stable multi-step forecast by extrapolating the estimated structural components rather than relying solely on autoregressive memory.

In [39]:
# ------------------------------------------------------------
# 6.3 Prophet functions
# ------------------------------------------------------------


def to_prophet_df(y_series: pd.Series) -> pd.DataFrame:
    """
    Convert a time series into Prophet-compatible DataFrame format.

    Parameters
    ----------
    y_series : pd.Series
        Univariate time series indexed by date.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns:
        - 'ds': datestamps
        - 'y' : observed values
    """
    df = pd.DataFrame({"ds": y_series.index, "y": y_series.values})
    return df


def fit_predict_prophet(y_train: pd.Series, horizon: int):
    """
    Fit a Prophet model to the training data and generate a multi-step forecast.

    The model includes weekly and yearly seasonality and is refitted each time
    the function is called. This design supports use within a rolling-origin
    cross-validation framework.

    Parameters
    ----------
    y_train : pd.Series
        Training time series of daily paediatric ED attendances.
    horizon : int
        Number of days ahead to forecast.

    Returns
    -------
    tuple
        - np.ndarray of length `horizon` containing point forecasts
        - fitted Prophet model object
    """
    df_train = to_prophet_df(y_train)

    model = Prophet(
        weekly_seasonality=True, yearly_seasonality=True, daily_seasonality=False
    )
    model.fit(df_train)

    future = model.make_future_dataframe(periods=horizon, freq="D")
    forecast = model.predict(future)

    preds = forecast["yhat"].iloc[-horizon:].to_numpy()
    return preds, model


def forecast_prophet(train_series: pd.Series, horizon: int) -> np.ndarray:
    """
    Wrapper function to produce Prophet forecasts for rolling-origin CV.

    Parameters
    ----------
    train_series : pd.Series
        Training time series indexed by date.
    horizon : int
        Number of days ahead to forecast.

    Returns
    -------
    np.ndarray
        Array of length `horizon` containing the point forecasts.
    """
    preds, _ = fit_predict_prophet(train_series, horizon)
    return preds

6.4 Model comparison¶

Each forecasting method was evaluated using the same rolling-origin cross-validation scheme on the training data.

For every model, forecasts are generated at multiple forecast origins and summarised using MAE, RMSE, and sMAPE over the full 28-day horizon (Fig 6.5.1).

Lead-time error analysis requires horizon-wise cross-validation results and was therefore conducted using the long-format rolling-origin output (Fig 6.5.2).

In [40]:
# ------------------------------------------------------------
# 6.4 Cross-Validated Model Comparison
# ------------------------------------------------------------

y = ts_train.astype(float)

# run CV once per model to get long-format results
cv_snaive = rolling_origin_cv(y, initial_window=INITIAL_WINDOW, step=STEP, forecaster=forecast_snaive_weekly)
cv_sarimax = rolling_origin_cv(y, initial_window=INITIAL_WINDOW, step=STEP, forecaster=forecast_sarimax)
cv_prophet = rolling_origin_cv(y, initial_window=INITIAL_WINDOW, step=STEP, forecaster=forecast_prophet)

# Summarise into short format (one row per origin)
cv_snaive_summary = cv_summarise(cv_snaive)
cv_sarimax_summary = cv_summarise(cv_sarimax)
cv_prophet_summary = cv_summarise(cv_prophet)

# add model labels
cv_snaive["model"] = cv_snaive_summary["model"] = "SNaive_weekly"
cv_sarimax["model"] = cv_sarimax_summary["model"] = "SARIMAX_auto"
cv_prophet["model"] = cv_prophet_summary["model"] = "Prophet"

# Combine cross-validation results from all models into single
# DataFrames to enable grouped comparison of performance metrics
cv_all = pd.concat([cv_snaive, cv_sarimax, cv_prophet], ignore_index=True)
cv_summary_all = pd.concat(
    [cv_snaive_summary, cv_sarimax_summary, cv_prophet_summary], ignore_index=True
)

# Summaries across all origins AND all horizons (h=1..28)
summary = cv_summary_all.groupby("model")[["MAE", "RMSE", "sMAPE"]].agg(["mean", "std"])
summary
Out[40]:
MAE RMSE sMAPE
mean std mean std mean std
model
Prophet 6.317821 1.269248 7.853103 1.446243 12.680870 2.022300
SARIMAX_auto 7.824004 2.416947 9.628206 2.703484 15.622757 4.658216
SNaive_weekly 9.329863 2.557154 11.596437 2.884205 18.594940 4.653329
In [41]:
# ------------------------------------------------------------
# 6.5.1 Model Metrics Comparison Chart
# ------------------------------------------------------------

metrics = ["MAE", "RMSE", "sMAPE"]
order = ["SNaive_weekly", "SARIMAX_auto", "Prophet"]
colors = {
    "SNaive_weekly": "#4C72B0",   # blue
    "SARIMAX_auto": "#DD8452",    # orange
    "Prophet": "#55A868",         # green
}

# Summarise mean ± SD for each metric
summary = {
    m: (cv_summary_all.groupby("model")[m].agg(["mean", "std"]).reindex(order))
    for m in metrics
}

# Plot
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(8, 10))

x = np.arange(len(order))

for ax, m in zip(axes, metrics):
    s = summary[m]
    bar_colors = [colors[model] for model in order]
    ax.bar(x, s["mean"].values, yerr=s["std"].values, capsize=4, color=bar_colors)
    ax.set_xticks(x)
    ax.set_xticklabels(order, rotation=0, ha="center")
    ax.set_ylabel(f"{m} (mean ± SD)")
    ax.set_title(m)

fig.suptitle("Fig 6.5.1: Rolling-origin CV: Forecast accuracy by model", y=0.98)
fig.tight_layout()
plt.show()
No description has been provided for this image
In [42]:
# ------------------------------------------------------------
# 6.5.2 Model Lead Time Comparison Chart
# ------------------------------------------------------------

# Mean absolute error by lead time
mae_by_h = cv_all.groupby(["model", "h"])["abs_error"].mean().reset_index()

plt.figure(figsize=(8, 4))
for model in ["SNaive_weekly", "SARIMAX_auto", "Prophet"]:
    subset = mae_by_h[mae_by_h["model"] == model]
    plt.plot(subset["h"], subset["abs_error"], label=model, color=colors[model] )

plt.xlabel("Forecast lead time (days)")
plt.ylabel("Mean absolute error")
plt.xticks(range(1, 29, 3))
plt.title("Fig 6.5.2: Forecast error by lead time")
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

7. Model Selection and Forecasting Results¶

7.1 Model Selection¶

Figure 6.5.1 indicates that the Prophet model achieved the lowest mean MAE, RMSE, and sMAPE across cross-validation folds, outperforming both the Seasonal Naïve benchmark and the SARIMAX model. In addition to superior average accuracy, Prophet exhibited comparatively lower variability across folds, suggesting greater stability in performance over time.

Figure 6.5.2 further demonstrates that forecast error increases with lead time for all models, as expected in multi-step forecasting. However, the rate of degradation is substantially greater for the Seasonal Naïve and SARIMAX models, whereas Prophet maintains consistently lower error across most forecast horizons. This relative stability at longer lead times is particularly relevant for operational planning over multi-week periods.

On this basis, Prophet was selected as the preferred model for generating forecasts.

7.2 Final 28 day Forecast¶

Following model selection, the Prophet model was refitted using the full available training dataset to maximise information used for parameter estimation. A 28-day multi-step forecast was then generated from the final fitted model.

The forecast includes point predictions for each day and associated 80% prediction intervals, derived from the model’s uncertainty estimates.

The resulting forecast is shown in Figure 7.1.

In [43]:
# ------------------------------------------------------------
# 7.2 Final 28 day Forecast (prophet)
# ------------------------------------------------------------

# Fit final model to full training data, not just the initial window.
y_full = ts_full.astype(float)

df_full = pd.DataFrame({"ds": pd.to_datetime(y_full.index), "y": y_full.values})


final_model = Prophet(
    weekly_seasonality=True, yearly_seasonality=True, daily_seasonality=False
)

final_model.fit(df_full)

# -------------------------------
# Forecast next 28 days
# -------------------------------
H = 28

future = final_model.make_future_dataframe(periods=H, freq="D")
forecast = final_model.predict(future)

fcst_28 = forecast.tail(H).copy()

forecast_mean = forecast["yhat"].iloc[-H:]
forecast_ci_lower = forecast["yhat_lower"].iloc[-H:]
forecast_ci_upper = forecast["yhat_upper"].iloc[-H:]

# -------------------------------
# Plot recent observations + forecast
# -------------------------------
plt.figure(figsize=(10, 4))

plt.plot(y_full.index[-90:], y_full.iloc[-90:], label="Recent observed")

plt.plot(fcst_28["ds"], fcst_28["yhat"], label=f"{H}-day forecast")

plt.fill_between(
    fcst_28["ds"],
    fcst_28["yhat_lower"],
    fcst_28["yhat_upper"],
    color="grey",
    alpha=0.3,
    label="80% Prediction interval",
)

plt.xlabel("Date")
plt.ylabel("Daily paediatric ED attendances")
plt.title(f"{H}-day forecast of paediatric ED attendances (Prophet)")
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

7.3 Interpretation of forecast¶

The forecast maintain clear weekly variation, consistent with historical seasonal patterns observed in the data. Predicted attendances fluctuate systematically across days of the week.

Demand is projected to rise over the 28-day horizon, stabilising toward the end of the period. This variation appears primarily driven by long term seasonal structure rather than an underlying trend.

The 80% prediction interval remains similar across the forecast horizon. While most daily attendances are expected to fall near the central estimate, occasional higher-demand days remain plausible near the upper bound of the interval. These projections may assist short-term operational planning, particularly for staffing allocation on anticipated peak days.

8 Conclusions and Recommendations¶

This study evaluated short-term forecasting approaches for daily paediatric emergency department attendances to support operational planning within an NHS acute trust. Using a rolling-origin cross-validation framework, forecasting models were compared against naïve benchmarks over a 28-day horizon. The Prophet model consistently outperformed baseline methods and a SARIMAX alternative, providing lower forecast error while remaining interpretable and practical for routine use.

Based on these findings, the Trust is recommended to use the Prophet model as a short-term planning tool to inform paediatric ED staffing and escalation planning over a four-week horizon.

8.1 Strengths of this approach¶

The model’s treatment of weekly and yearly seasonality and ease of retraining make it well suited to operational NHS settings where forecasts must be updated regularly and understood by clinicians.

Prophet demonstrates superior performance across forecast horizons. This may be attributable to its flexible seasonal decomposition rather than trend modelling. By representing seasonality using Fourier terms, Prophet may produce smoother and more stable multi-step forecasts, whereas autoregressive models and seasonal naïve benchmarks accumulate error as the forecast horizon increases.

8.2 Limitations¶

The main value of the forecasts is in identifying relative pressure and risk over upcoming weeks. Prediction intervals highlight substantial uncertainty in daily demand, and forecasts are best interpreted as indicative ranges of expected activity, helping services anticipate periods of elevated risk, rather than giving precise daily counts.

Rolling-origin cross-validation assumes that past performance is indicative of future behaviour under similar conditions. These assumptions are reasonable for short-term operational planning but may not hold under regime change or structural disruption.

There is a risk of overfitting, as the model may capture patterns specific to the historical period that do not persist. In particular, unexpected structural shifts, policy changes, seasonal anomalies, or external shocks could reduce forecast accuracy. The absence of exogenous regressors (e.g., infectious disease outbreaks, weather variation) limits the model’s ability to respond to demand drivers not encoded in historical patterns. Consequently, forecasts should be interpreted as conditional on historical structure continuing.

8.3 Implications for staffing and operational planning¶

The 28-day forecasts are intended to support short-term staffing decisions. Forecast means can be used to anticipate relative pressure across upcoming weeks, while upper prediction bounds provide an indication of days where additional staffing flexibility or escalation capacity may be required. For example, periods where the upper prediction interval exceeds recent activity levels could trigger proactive review of rotas and contingency staffing arrangements.