使用 LOESS 进行季节性趋势分解 (STL)¶
此笔记本演示了 STL
的使用,它将时间序列分解为三个部分:趋势、季节性和残差。STL 使用 LOESS(局部加权回归)来提取三个部分的平滑估计。 STL
的关键输入是
season
- 季节性平滑器的长度。必须为奇数。trend
- 趋势平滑器的长度,通常约为season
的 150%。必须为奇数且大于season
。low_pass
- 低通估计窗口的长度,通常是大于数据周期性的最小奇数。
首先,我们导入所需的包,准备图形环境并准备数据。
[1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
sns.set_style("darkgrid")
[2]:
plt.rc("figure", figsize=(16, 12))
plt.rc("font", size=13)
大气 CO2¶
Cleveland、Cleveland、McRae 和 Terpenning(1990 年)的示例使用 CO2 数据,这些数据在下面的列表中。此月度数据(1959 年 1 月至 1987 年 12 月)在整个样本中具有明显的趋势和季节性。
[3]:
co2 = [
315.58,
316.39,
316.79,
317.82,
318.39,
318.22,
316.68,
315.01,
314.02,
313.55,
315.02,
315.75,
316.52,
317.10,
317.79,
319.22,
320.08,
319.70,
318.27,
315.99,
314.24,
314.05,
315.05,
316.23,
316.92,
317.76,
318.54,
319.49,
320.64,
319.85,
318.70,
316.96,
315.17,
315.47,
316.19,
317.17,
318.12,
318.72,
319.79,
320.68,
321.28,
320.89,
319.79,
317.56,
316.46,
315.59,
316.85,
317.87,
318.87,
319.25,
320.13,
321.49,
322.34,
321.62,
319.85,
317.87,
316.36,
316.24,
317.13,
318.46,
319.57,
320.23,
320.89,
321.54,
322.20,
321.90,
320.42,
318.60,
316.73,
317.15,
317.94,
318.91,
319.73,
320.78,
321.23,
322.49,
322.59,
322.35,
321.61,
319.24,
318.23,
317.76,
319.36,
319.50,
320.35,
321.40,
322.22,
323.45,
323.80,
323.50,
322.16,
320.09,
318.26,
317.66,
319.47,
320.70,
322.06,
322.23,
322.78,
324.10,
324.63,
323.79,
322.34,
320.73,
319.00,
318.99,
320.41,
321.68,
322.30,
322.89,
323.59,
324.65,
325.30,
325.15,
323.88,
321.80,
319.99,
319.86,
320.88,
322.36,
323.59,
324.23,
325.34,
326.33,
327.03,
326.24,
325.39,
323.16,
321.87,
321.31,
322.34,
323.74,
324.61,
325.58,
326.55,
327.81,
327.82,
327.53,
326.29,
324.66,
323.12,
323.09,
324.01,
325.10,
326.12,
326.62,
327.16,
327.94,
329.15,
328.79,
327.53,
325.65,
323.60,
323.78,
325.13,
326.26,
326.93,
327.84,
327.96,
329.93,
330.25,
329.24,
328.13,
326.42,
324.97,
325.29,
326.56,
327.73,
328.73,
329.70,
330.46,
331.70,
332.66,
332.22,
331.02,
329.39,
327.58,
327.27,
328.30,
328.81,
329.44,
330.89,
331.62,
332.85,
333.29,
332.44,
331.35,
329.58,
327.58,
327.55,
328.56,
329.73,
330.45,
330.98,
331.63,
332.88,
333.63,
333.53,
331.90,
330.08,
328.59,
328.31,
329.44,
330.64,
331.62,
332.45,
333.36,
334.46,
334.84,
334.29,
333.04,
330.88,
329.23,
328.83,
330.18,
331.50,
332.80,
333.22,
334.54,
335.82,
336.45,
335.97,
334.65,
332.40,
331.28,
330.73,
332.05,
333.54,
334.65,
335.06,
336.32,
337.39,
337.66,
337.56,
336.24,
334.39,
332.43,
332.22,
333.61,
334.78,
335.88,
336.43,
337.61,
338.53,
339.06,
338.92,
337.39,
335.72,
333.64,
333.65,
335.07,
336.53,
337.82,
338.19,
339.89,
340.56,
341.22,
340.92,
339.26,
337.27,
335.66,
335.54,
336.71,
337.79,
338.79,
340.06,
340.93,
342.02,
342.65,
341.80,
340.01,
337.94,
336.17,
336.28,
337.76,
339.05,
340.18,
341.04,
342.16,
343.01,
343.64,
342.91,
341.72,
339.52,
337.75,
337.68,
339.14,
340.37,
341.32,
342.45,
343.05,
344.91,
345.77,
345.30,
343.98,
342.41,
339.89,
340.03,
341.19,
342.87,
343.74,
344.55,
345.28,
347.00,
347.37,
346.74,
345.36,
343.19,
340.97,
341.20,
342.76,
343.96,
344.82,
345.82,
347.24,
348.09,
348.66,
347.90,
346.27,
344.21,
342.88,
342.58,
343.99,
345.31,
345.98,
346.72,
347.63,
349.24,
349.83,
349.10,
347.52,
345.43,
344.48,
343.89,
345.29,
346.54,
347.66,
348.07,
349.12,
350.55,
351.34,
350.80,
349.10,
347.54,
346.20,
346.20,
347.44,
348.67,
]
co2 = pd.Series(
co2, index=pd.date_range("1-1-1959", periods=len(co2), freq="M"), name="CO2"
)
co2.describe()
/tmp/ipykernel_3881/1071343913.py:352: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
co2, index=pd.date_range("1-1-1959", periods=len(co2), freq="M"), name="CO2"
[3]:
count 348.000000
mean 330.123879
std 10.059747
min 313.550000
25% 321.302500
50% 328.820000
75% 338.002500
max 351.340000
Name: CO2, dtype: float64
分解需要 1 个输入,即数据序列。如果数据序列没有频率,那么你还必须指定 period
。 seasonal
的默认值为 7,因此在大多数应用程序中也应该更改。
[4]:
from statsmodels.tsa.seasonal import STL
stl = STL(co2, seasonal=13)
res = stl.fit()
fig = res.plot()
稳健拟合¶
设置 robust
将使用数据依赖的加权函数,该函数在估计 LOESS 时重新加权数据(因此使用 LOWESS)。使用稳健估计允许模型容忍在底部图中可见的较大误差。
这里我们使用一个测量欧盟电气设备生产量的序列。
[5]:
from statsmodels.datasets import elec_equip as ds
elec_equip = ds.load().data.iloc[:, 0]
接下来,我们使用和不使用稳健加权估计模型。差异很小,在 2008 年金融危机期间最为明显。非稳健估计对所有观察结果赋予相同的权重,因此平均产生较小的误差。权重在 0 到 1 之间变化。
[6]:
def add_stl_plot(fig, res, legend):
"""Add 3 plots from a second STL fit"""
axs = fig.get_axes()
comps = ["trend", "seasonal", "resid"]
for ax, comp in zip(axs[1:], comps):
series = getattr(res, comp)
if comp == "resid":
ax.plot(series, marker="o", linestyle="none")
else:
ax.plot(series)
if comp == "trend":
ax.legend(legend, frameon=False)
stl = STL(elec_equip, period=12, robust=True)
res_robust = stl.fit()
fig = res_robust.plot()
res_non_robust = STL(elec_equip, period=12, robust=False).fit()
add_stl_plot(fig, res_non_robust, ["Robust", "Non-robust"])
[7]:
fig = plt.figure(figsize=(16, 5))
lines = plt.plot(res_robust.weights, marker="o", linestyle="none")
ax = plt.gca()
xlim = ax.set_xlim(elec_equip.index[0], elec_equip.index[-1])
LOESS 度数¶
默认配置使用常数和趋势来估计 LOESS 模型。可以通过将 COMPONENT_deg
设置为 0 来更改为仅包含常数。这里,度数几乎没有区别,除了在 2008 年金融危机期间的趋势之外。
[8]:
stl = STL(
elec_equip, period=12, seasonal_deg=0, trend_deg=0, low_pass_deg=0, robust=True
)
res_deg_0 = stl.fit()
fig = res_robust.plot()
add_stl_plot(fig, res_deg_0, ["Degree 1", "Degree 0"])
性能¶
可以使用三种选项来降低 STL 分解的计算成本
seasonal_jump
trend_jump
low_pass_jump
当这些值不为零时,组件 COMPONENT
的 LOESS 仅在每 COMPONENT_jump
个观察值进行估计,并且在点之间使用线性插值。这些值通常不应超过 seasonal
、trend
或 low_pass
大小的 10-20%。
下面的示例展示了如何使用模拟数据(具有低频余弦趋势和正弦季节模式)将计算成本降低 15 倍。
[9]:
import numpy as np
rs = np.random.RandomState(0xA4FD94BC)
tau = 2000
t = np.arange(tau)
period = int(0.05 * tau)
seasonal = period + ((period % 2) == 0) # Ensure odd
e = 0.25 * rs.standard_normal(tau)
y = np.cos(t / tau * 2 * np.pi) + 0.25 * np.sin(t / period * 2 * np.pi) + e
plt.plot(y)
plt.title("Simulated Data")
xlim = plt.gca().set_xlim(0, tau)
首先,使用所有跳跃都等于 1 来估计基线模型。
[10]:
mod = STL(y, period=period, seasonal=seasonal)
%timeit mod.fit()
res = mod.fit()
fig = res.plot(observed=False, resid=False)
284 ms ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
所有跳跃都设置为其窗口长度的 15%。有限的线性插值对模型的拟合几乎没有影响。
[11]:
low_pass_jump = seasonal_jump = int(0.15 * (period + 1))
trend_jump = int(0.15 * 1.5 * (period + 1))
mod = STL(
y,
period=period,
seasonal=seasonal,
seasonal_jump=seasonal_jump,
trend_jump=trend_jump,
low_pass_jump=low_pass_jump,
)
%timeit mod.fit()
res = mod.fit()
fig = res.plot(observed=False, resid=False)
22.9 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
使用 STL 进行预测¶
STLForecast
简化了使用 STL 来消除季节性,然后使用标准时间序列模型来预测趋势和循环部分的过程。
这里,我们使用 STL 来处理季节性,然后使用 ARIMA(1,1,0) 来建模去季节化数据。季节性成分是根据找到的完整周期进行预测的,其中
其中 \(k= m - h + m \lfloor \frac{h-1}{m} \rfloor\)。预测会自动将季节性成分预测添加到 ARIMA 预测中。
[12]:
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.forecasting.stl import STLForecast
elec_equip.index.freq = elec_equip.index.inferred_freq
stlf = STLForecast(elec_equip, ARIMA, model_kwargs=dict(order=(1, 1, 0), trend="t"))
stlf_res = stlf.fit()
forecast = stlf_res.forecast(24)
plt.plot(elec_equip)
plt.plot(forecast)
plt.show()
summary
包含有关时间序列模型和 STL 分解的信息。
[13]:
print(stlf_res.summary())
STL Decomposition and SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 257
Model: ARIMA(1, 1, 0) Log Likelihood -522.434
Date: Thu, 03 Oct 2024 AIC 1050.868
Time: 15:45:57 BIC 1061.504
Sample: 01-01-1995 HQIC 1055.146
- 05-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.1171 0.118 0.995 0.320 -0.113 0.348
ar.L1 -0.0435 0.049 -0.880 0.379 -0.140 0.053
sigma2 3.4682 0.188 18.406 0.000 3.099 3.837
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 223.01
Prob(Q): 0.92 Prob(JB): 0.00
Heteroskedasticity (H): 0.33 Skew: -0.26
Prob(H) (two-sided): 0.00 Kurtosis: 7.54
STL Configuration
=================================================================================
Period: 12 Trend Length: 23
Seasonal: 7 Trend deg: 1
Seasonal deg: 1 Trend jump: 1
Seasonal jump: 1 Low pass: 13
Robust: False Low pass deg: 1
---------------------------------------------------------------------------------
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).