广义可加模型 (GAM)

广义可加模型允许对广义线性模型中的平滑项进行惩罚估计。

参见 模块参考 以了解命令和参数。

示例

以下示例展示了高斯和泊松回归,其中分类变量被视为线性项,而两个解释变量的影响由惩罚 B 样条曲线捕获。数据来自汽车数据集 https://archive.ics.uci.edu/ml/datasets/automobile 我们可以在单元测试模块中加载一个包含选定列的 DataFrame。

In [1]: import statsmodels.api as sm

In [2]: from statsmodels.gam.api import GLMGam, BSplines

# import data
In [3]: from statsmodels.gam.tests.test_penalized import df_autos

# create spline basis for weight and hp
In [4]: x_spline = df_autos[['weight', 'hp']]

In [5]: bs = BSplines(x_spline, df=[12, 10], degree=[3, 3])

# penalization weight
In [6]: alpha = np.array([21833888.8, 6460.38479])

In [7]: gam_bs = GLMGam.from_formula('city_mpg ~ fuel + drive', data=df_autos,
   ...:                              smoother=bs, alpha=alpha)
   ...: 

In [8]: res_bs = gam_bs.fit()

In [9]: print(res_bs.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:               city_mpg   No. Observations:                  203
Model:                         GLMGam   Df Residuals:                   189.13
Model Family:                Gaussian   Df Model:                        12.87
Link Function:               Identity   Scale:                          4.8825
Method:                         PIRLS   Log-Likelihood:                -441.81
Date:                Thu, 03 Oct 2024   Deviance:                       923.45
Time:                        16:09:46   Pearson chi2:                     923.
No. Iterations:                     3   Pseudo R-squ. (CS):             0.9996
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       51.9923      1.997     26.034      0.000      48.078      55.906
fuel[T.gas]     -5.8099      0.727     -7.989      0.000      -7.235      -4.385
drive[T.fwd]     1.3910      0.819      1.699      0.089      -0.213       2.995
drive[T.rwd]     1.0638      0.842      1.263      0.207      -0.587       2.715
weight_s0       -3.5556      0.959     -3.707      0.000      -5.436      -1.676
weight_s1       -9.0876      1.750     -5.193      0.000     -12.518      -5.658
weight_s2      -13.0303      1.827     -7.132      0.000     -16.611      -9.450
weight_s3      -14.2641      1.854     -7.695      0.000     -17.897     -10.631
weight_s4      -15.1805      1.892     -8.024      0.000     -18.889     -11.472
weight_s5      -15.9557      1.963     -8.128      0.000     -19.803     -12.108
weight_s6      -16.6297      2.038     -8.161      0.000     -20.624     -12.636
weight_s7      -16.9928      2.045     -8.308      0.000     -21.002     -12.984
weight_s8      -19.3480      2.367     -8.174      0.000     -23.987     -14.709
weight_s9      -20.7978      2.455     -8.472      0.000     -25.609     -15.986
weight_s10     -20.8062      2.443     -8.517      0.000     -25.594     -16.018
hp_s0           -1.4473      0.558     -2.592      0.010      -2.542      -0.353
hp_s1           -3.4228      1.012     -3.381      0.001      -5.407      -1.438
hp_s2           -5.9026      1.251     -4.717      0.000      -8.355      -3.450
hp_s3           -7.2389      1.352     -5.354      0.000      -9.889      -4.589
hp_s4           -9.1052      1.384     -6.581      0.000     -11.817      -6.393
hp_s5           -9.9865      1.525     -6.547      0.000     -12.976      -6.997
hp_s6          -13.3639      2.228     -5.998      0.000     -17.731      -8.997
hp_s7          -13.8902      3.194     -4.349      0.000     -20.150      -7.630
hp_s8          -11.9752      2.556     -4.685      0.000     -16.985      -6.965
================================================================================

# plot smooth components
In [10]: res_bs.plot_partial(0, cpr=True)
Out[10]: <Figure size 640x480 with 1 Axes>

In [11]: res_bs.plot_partial(1, cpr=True)
Out[11]: <Figure size 640x480 with 1 Axes>

In [12]: alpha = np.array([8283989284.5829611, 14628207.58927821])

In [13]: gam_bs = GLMGam.from_formula('city_mpg ~ fuel + drive', data=df_autos,
   ....:                              smoother=bs, alpha=alpha,
   ....:                              family=sm.families.Poisson())
   ....: 

In [14]: res_bs = gam_bs.fit()

In [15]: print(res_bs.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:               city_mpg   No. Observations:                  203
Model:                         GLMGam   Df Residuals:                   194.75
Model Family:                 Poisson   Df Model:                         7.25
Link Function:                    Log   Scale:                          1.0000
Method:                         PIRLS   Log-Likelihood:                -530.38
Date:                Thu, 03 Oct 2024   Deviance:                       37.569
Time:                        16:09:46   Pearson chi2:                     37.4
No. Iterations:                     6   Pseudo R-squ. (CS):             0.7715
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        3.9960      0.130     30.844      0.000       3.742       4.250
fuel[T.gas]     -0.2398      0.057     -4.222      0.000      -0.351      -0.128
drive[T.fwd]     0.0386      0.075      0.513      0.608      -0.109       0.186
drive[T.rwd]     0.0309      0.078      0.395      0.693      -0.122       0.184
weight_s0       -0.0811      0.030     -2.689      0.007      -0.140      -0.022
weight_s1       -0.1938      0.063     -3.067      0.002      -0.318      -0.070
weight_s2       -0.3160      0.082     -3.864      0.000      -0.476      -0.156
weight_s3       -0.3735      0.090     -4.160      0.000      -0.549      -0.198
weight_s4       -0.4187      0.096     -4.360      0.000      -0.607      -0.230
weight_s5       -0.4645      0.103     -4.495      0.000      -0.667      -0.262
weight_s6       -0.5092      0.112     -4.555      0.000      -0.728      -0.290
weight_s7       -0.5469      0.119     -4.598      0.000      -0.780      -0.314
weight_s8       -0.6211      0.137     -4.528      0.000      -0.890      -0.352
weight_s9       -0.6866      0.153     -4.486      0.000      -0.987      -0.387
weight_s10      -0.7370      0.174     -4.228      0.000      -1.079      -0.395
hp_s0           -0.0247      0.010     -2.378      0.017      -0.045      -0.004
hp_s1           -0.0557      0.022     -2.479      0.013      -0.100      -0.012
hp_s2           -0.1046      0.038     -2.719      0.007      -0.180      -0.029
hp_s3           -0.1438      0.050     -2.857      0.004      -0.242      -0.045
hp_s4           -0.1919      0.063     -3.047      0.002      -0.315      -0.068
hp_s5           -0.2567      0.079     -3.231      0.001      -0.412      -0.101
hp_s6           -0.4152      0.120     -3.455      0.001      -0.651      -0.180
hp_s7           -0.4889      0.152     -3.214      0.001      -0.787      -0.191
hp_s8           -0.5470      0.195     -2.810      0.005      -0.928      -0.166
================================================================================

# Optimal penalization weights alpha can be obtained through generalized
# cross-validation or k-fold cross-validation.
# The alpha above are from the unit tests against the R mgcv package.
In [16]: gam_bs.select_penweight()[0]
Out[16]: array([8.2839e+09, 1.4628e+07])

In [17]: gam_bs.select_penweight_kfold()[0]
Out[17]: (np.float64(10000000.0), np.float64(15848.931924611108))

参考文献

  • Hastie, Trevor, and Robert Tibshirani. 1986. Generalized Additive Models. Statistical Science 1 (3): 297-310.

  • Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Texts in Statistical Science. Boca Raton, FL: Chapman & Hall/CRC.

  • Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R. Second edition. Chapman & Hall/CRC Texts in Statistical Science. Boca Raton: CRC Press/Taylor & Francis Group.

模块参考

模型类

GLMGam(endog[, exog, smoother, alpha, ...])

广义可加模型 (GAM)

LogitGam(endog, smoother, alpha, *args, **kwargs)

离散 Logit 的广义可加模型

结果类

GLMGamResults(model, params, ...)

广义可加模型 (GAM) 的结果类。

平滑基函数

目前已验证对两种样条基的支持

BSplines(x, df, degree[, include_intercept, ...])

使用 B 样条的加性平滑分量

CyclicCubicSplines(x, df[, constraints, ...])

使用循环三次回归样条的加性平滑分量

statsmodels.gam.smooth_basis 包含其他样条和 (全局) 多项式平滑基,但尚未进行验证。

GLMGam 中的分布族与 GLM 相同,相应的连接函数也相同。当前的单元测试只涵盖了高斯和泊松,GLMGam 可能会无法在 GLM 中支持所有可用的选项。


最后更新时间:2024 年 10 月 3 日