加权广义线性模型

[1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

加权 GLM: 泊松响应数据

加载数据

在本示例中,我们将使用 affair 数据集,使用一些外生变量来预测婚外情率。

将生成权重以显示 freq_weights 等同于重复数据记录。另一方面,var_weights 等同于聚合数据。

[2]:
print(sm.datasets.fair.NOTE)
::

    Number of observations: 6366
    Number of variables: 9
    Variable name definitions:

        rate_marriage   : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
                        4 = good, 5 = very good
        age             : Age
        yrs_married     : No. years married. Interval approximations. See
                        original paper for detailed explanation.
        children        : No. children
        religious       : How relgious, 1 = not, 2 = mildly, 3 = fairly,
                        4 = strongly
        educ            : Level of education, 9 = grade school, 12 = high
                        school, 14 = some college, 16 = college graduate,
                        17 = some graduate school, 20 = advanced degree
        occupation      : 1 = student, 2 = farming, agriculture; semi-skilled,
                        or unskilled worker; 3 = white-colloar; 4 = teacher
                        counselor social worker, nurse; artist, writers;
                        technician, skilled worker, 5 = managerial,
                        administrative, business, 6 = professional with
                        advanced degree
        occupation_husb : Husband's occupation. Same as occupation.
        affairs         : measure of time spent in extramarital affairs

    See the original paper for more details.

将数据加载到 pandas 数据帧中。

[3]:
data = sm.datasets.fair.load_pandas().data

因变量 (内生变量) 是 affairs

[4]:
data.describe()
[4]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
count 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000
mean 4.109645 29.082862 9.009425 1.396874 2.426170 14.209865 3.424128 3.850141 0.705374
std 0.961430 6.847882 7.280120 1.433471 0.878369 2.178003 0.942399 1.346435 2.203374
min 1.000000 17.500000 0.500000 0.000000 1.000000 9.000000 1.000000 1.000000 0.000000
25% 4.000000 22.000000 2.500000 0.000000 2.000000 12.000000 3.000000 3.000000 0.000000
50% 4.000000 27.000000 6.000000 1.000000 2.000000 14.000000 3.000000 4.000000 0.000000
75% 5.000000 32.000000 16.500000 2.000000 3.000000 16.000000 4.000000 5.000000 0.484848
max 5.000000 42.000000 23.000000 5.500000 4.000000 20.000000 6.000000 6.000000 57.599991
[5]:
data[:3]
[5]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
0 3.0 32.0 9.0 3.0 3.0 17.0 2.0 5.0 0.111111
1 3.0 27.0 13.0 3.0 1.0 14.0 3.0 4.0 3.230769
2 4.0 22.0 2.5 0.0 1.0 16.0 3.0 5.0 1.400000

在下文中,我们将主要使用 Poisson。虽然使用十进制婚外情有效,但我们将它们转换为整数以获得计数分布。

[6]:
data["affairs"] = np.ceil(data["affairs"])
data[:3]
[6]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
0 3.0 32.0 9.0 3.0 3.0 17.0 2.0 5.0 1.0
1 3.0 27.0 13.0 3.0 1.0 14.0 3.0 4.0 4.0
2 4.0 22.0 2.5 0.0 1.0 16.0 3.0 5.0 2.0
[7]:
(data["affairs"] == 0).mean()
[7]:
np.float64(0.6775054979579014)
[8]:
np.bincount(data["affairs"].astype(int))
[8]:
array([4313,  934,  488,  180,  130,  172,    7,   21,   67,    2,    0,
          0,   17,    0,    0,    0,    3,   12,    8,    0,    0,    0,
          0,    0,    2,    2,    2,    3,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    1,    1,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    1])

压缩和聚合观察结果

我们原始数据集中有 6366 个观察结果。当我们只考虑一些选定的变量时,我们只有更少的唯一观察结果。在下文中,我们将观察结果结合起来,首先我们将所有变量的值相同的观察结果结合起来,其次我们将具有相同解释变量的观察结果结合起来。

具有唯一观察结果的数据集

我们使用 pandas 的 groupby 来结合相同的观察结果,并创建一个新的变量 freq 来计算有多少个观察结果具有对应行中的值。

[9]:
data2 = data.copy()
data2["const"] = 1
dc = (
    data2["affairs rate_marriage age yrs_married const".split()]
    .groupby("affairs rate_marriage age yrs_married".split())
    .count()
)
dc.reset_index(inplace=True)
dc.rename(columns={"const": "freq"}, inplace=True)
print(dc.shape)
dc.head()
(476, 5)
[9]:
affairs rate_marriage age yrs_married freq
0 0.0 1.0 17.5 0.5 1
1 0.0 1.0 22.0 2.5 3
2 0.0 1.0 27.0 2.5 1
3 0.0 1.0 27.0 6.0 5
4 0.0 1.0 27.0 9.0 1

具有唯一解释变量 (exog) 的数据集

对于下一个数据集,我们将具有相同解释变量值的观察结果结合起来。但是,由于响应变量在组合的观察结果之间可能不同,因此我们将计算所有组合观察结果的响应变量的平均值和总和。

我们再次使用 pandas 的 groupby 来结合观察结果并创建新的变量。我们还将 MultiIndex 展平为简单索引。

[10]:
gr = data["affairs rate_marriage age yrs_married".split()].groupby(
    "rate_marriage age yrs_married".split()
)
df_a = gr.agg(["mean", "sum", "count"])


def merge_tuple(tpl):
    if isinstance(tpl, tuple) and len(tpl) > 1:
        return "_".join(map(str, tpl))
    else:
        return tpl


df_a.columns = df_a.columns.map(merge_tuple)
df_a.reset_index(inplace=True)
print(df_a.shape)
df_a.head()
(130, 6)
[10]:
rate_marriage age yrs_married affairs_mean affairs_sum affairs_count
0 1.0 17.5 0.5 0.000000 0.0 1
1 1.0 22.0 2.5 3.900000 39.0 10
2 1.0 27.0 2.5 3.400000 17.0 5
3 1.0 27.0 6.0 0.900000 9.0 10
4 1.0 27.0 9.0 1.333333 4.0 3

结合观察结果后,我们得到一个数据帧 dc,其中包含 467 个唯一观察结果,以及一个数据帧 df_a,其中包含 130 个具有解释变量唯一值的观察结果。

[11]:
print("number of rows: \noriginal, with unique observations, with unique exog")
data.shape[0], dc.shape[0], df_a.shape[0]
number of rows:
original, with unique observations, with unique exog
[11]:
(6366, 476, 130)

分析

在下文中,我们将比较原始数据的 GLM-Poisson 结果与组合观察结果的模型,其中多重性或聚合由权重或曝光给出。

原始数据

[12]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=data,
    family=sm.families.Poisson(),
)
res_o = glm.fit()
print(res_o.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                 6366
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Thu, 03 Oct 2024   Deviance:                       15375.
Time:                        15:45:07   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.2420
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[13]:
res_o.pearson_chi2 / res_o.df_resid
[13]:
np.float64(5.078702313363214)

压缩数据 (具有频率的唯一观察结果)

结合相同的观察结果并使用频率权重来考虑观察结果的多重性会产生完全相同的结果。当我们想要了解观察结果而不是所有相同观察结果的聚合时,一些结果属性会不同。例如,残差不考虑 freq_weights

[14]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    freq_weights=np.asarray(dc["freq"]),
)
res_f = glm.fit()
print(res_f.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Thu, 03 Oct 2024   Deviance:                       15375.
Time:                        15:45:07   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[15]:
res_f.pearson_chi2 / res_f.df_resid
[15]:
np.float64(5.078702313363196)

使用 var_weights 而不是 freq_weights 进行压缩

接下来,我们将比较 var_weightsfreq_weights。当内生变量反映平均值而不是相同的观察结果时,使用 var_weights 是常见的做法。我认为没有理论上的理由可以解释为什么它会产生相同的结果(一般来说)。

这会产生相同的结果,但 df_residfreq_weights 示例不同,因为 var_weights 不会改变有效观察结果的数量。

[16]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    var_weights=np.asarray(dc["freq"]),
)
res_fv = glm.fit()
print(res_fv.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                      472
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Thu, 03 Oct 2024   Deviance:                       15375.
Time:                        15:45:07   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

从结果计算出的离散度是不正确的,因为 df_resid 错误。如果我们使用原始 df_resid,它是正确的。

[17]:
res_fv.pearson_chi2 / res_fv.df_resid, res_f.pearson_chi2 / res_f.df_resid
[17]:
(np.float64(68.45488160512002), np.float64(5.078702313363196))

聚合或平均数据 (解释变量的唯一值)

对于这些情况,我们将具有相同解释变量值的观察结果结合起来。相应的响应变量是所有组合观察结果的总和或平均值。

使用 exposure

如果我们的因变量是所有组合观察结果的响应之和,那么在 Poisson 假设下,分布保持不变,但我们有不同的 exposure,它由一个聚合观察结果所代表的个体数量给出。

参数估计和参数协方差与原始数据相同,但对数似然、偏差和皮尔逊卡方不同

[18]:
glm = smf.glm(
    "affairs_sum ~ rate_marriage + age + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    exposure=np.asarray(df_a["affairs_count"]),
)
res_e = glm.fit()
print(res_e.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -740.75
Date:                Thu, 03 Oct 2024   Deviance:                       967.46
Time:                        15:45:07   Pearson chi2:                     926.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[19]:
res_e.pearson_chi2 / res_e.df_resid
[19]:
np.float64(7.35078910917951)

使用 var_weights

我们还可以使用所有组合观察结果的因变量的平均值。在这种情况下,方差将与一个组合观察结果所反映的总曝光的倒数相关。

[20]:
glm = smf.glm(
    "affairs_mean ~ rate_marriage + age + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    var_weights=np.asarray(df_a["affairs_count"]),
)
res_a = glm.fit()
print(res_a.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:           affairs_mean   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -5954.2
Date:                Thu, 03 Oct 2024   Deviance:                       967.46
Time:                        15:45:07   Pearson chi2:                     926.
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

比较

我们在上面的摘要打印中看到,paramscov_params 与相关的 Wald 推断在各个版本中一致。我们在下面总结了跨版本的各个结果属性的比较。

参数估计 params、参数的标准误差 bsepvalues,用于检验参数为零的测试都一致。但是,似然和拟合优度统计量 llfdeviancepearson_chi2 仅部分一致。具体来说,聚合版本与使用原始数据的結果不一致。

警告: llfdeviancepearson_chi2 的行为可能在将来的版本中仍然会发生变化。

解释变量唯一值的响应变量的总和和平均值都具有适当的似然解释。但是,这种解释没有反映在这三个统计数据中。从计算上来说,这可能是由于使用聚合数据时缺少调整造成的。然而,从理论上讲,我们可以考虑这些情况,特别是对于 var_weights 的误判情况,此时似然分析不合适,结果应被解释为准似然估计。var_weights 的定义存在歧义,因为它们可以用于具有正确指定似然性的平均值,以及用于准似然情况下的方差调整。我们目前没有试图匹配似然规范。但是,在下一节中,我们将证明,当我们假设基础模型正确指定时,似然比类型检验仍然会为所有聚合版本产生相同的结果。

[21]:
results_all = [res_o, res_f, res_e, res_a]
names = "res_o res_f res_e res_a".split()
[22]:
pd.concat([r.params for r in results_all], axis=1, keys=names)
[22]:
res_o res_f res_e res_a
截距 2.715533 2.715533 2.715533 2.715533
rate_marriage -0.495180 -0.495180 -0.495180 -0.495180
age -0.029914 -0.029914 -0.029914 -0.029914
yrs_married -0.010763 -0.010763 -0.010763 -0.010763
[23]:
pd.concat([r.bse for r in results_all], axis=1, keys=names)
[23]:
res_o res_f res_e res_a
截距 0.107360 0.107360 0.107360 0.107360
rate_marriage 0.011874 0.011874 0.011874 0.011874
age 0.004471 0.004471 0.004471 0.004471
yrs_married 0.004294 0.004294 0.004294 0.004294
[24]:
pd.concat([r.pvalues for r in results_all], axis=1, keys=names)
[24]:
res_o res_f res_e res_a
截距 3.756282e-141 3.756280e-141 3.756282e-141 3.756282e-141
rate_marriage 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
age 2.221918e-11 2.221918e-11 2.221918e-11 2.221918e-11
yrs_married 1.219200e-02 1.219200e-02 1.219200e-02 1.219200e-02
[25]:
pd.DataFrame(
    np.column_stack([[r.llf, r.deviance, r.pearson_chi2] for r in results_all]),
    columns=names,
    index=["llf", "deviance", "pearson chi2"],
)
[25]:
res_o res_f res_e res_a
llf -10350.913296 -10350.913296 -740.748534 -5954.219866
deviance 15374.679054 15374.679054 967.455734 967.455734
皮尔逊卡方 32310.704118 32310.704118 926.199428 926.199428

似然比类型检验

我们在上面看到,似然和相关的统计数据在聚合数据和原始数据之间不一致。我们将在下面说明,似然比检验和偏差的差异在各个版本中是一致的,但是皮尔逊卡方则不一致。

和以前一样:这还不够清楚,可能会发生变化。

作为一个测试用例,我们删除 age 变量并计算似然比类型统计量,作为简化或约束模型与完整或无约束模型之间的差异。

原始观察结果和频率权重

[26]:
glm = smf.glm(
    "affairs ~ rate_marriage + yrs_married", data=data, family=sm.families.Poisson()
)
res_o2 = glm.fit()
# print(res_f2.summary())
res_o2.pearson_chi2 - res_o.pearson_chi2, res_o2.deviance - res_o.deviance, res_o2.llf - res_o.llf
[26]:
(np.float64(52.91343161907935),
 np.float64(45.726693322505525),
 np.float64(-22.863346661251853))
[27]:
glm = smf.glm(
    "affairs ~ rate_marriage + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    freq_weights=np.asarray(dc["freq"]),
)
res_f2 = glm.fit()
# print(res_f2.summary())
res_f2.pearson_chi2 - res_f.pearson_chi2, res_f2.deviance - res_f.deviance, res_f2.llf - res_f.llf
[27]:
(np.float64(52.913431618650066),
 np.float64(45.726693322507344),
 np.float64(-22.863346661253672))

聚合数据: exposurevar_weights

注意: LR 检验与原始观察结果一致,pearson_chi2 不同,符号错误。

[28]:
glm = smf.glm(
    "affairs_sum ~ rate_marriage + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    exposure=np.asarray(df_a["affairs_count"]),
)
res_e2 = glm.fit()
res_e2.pearson_chi2 - res_e.pearson_chi2, res_e2.deviance - res_e.deviance, res_e2.llf - res_e.llf
[28]:
(np.float64(-31.618527525103445),
 np.float64(45.72669332250598),
 np.float64(-22.863346661252763))
[29]:
glm = smf.glm(
    "affairs_mean ~ rate_marriage + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    var_weights=np.asarray(df_a["affairs_count"]),
)
res_a2 = glm.fit()
res_a2.pearson_chi2 - res_a.pearson_chi2, res_a2.deviance - res_a.deviance, res_a2.llf - res_a.llf
[29]:
(np.float64(-31.61852752510788),
 np.float64(45.72669332250621),
 np.float64(-22.863346661252763))

调查皮尔逊卡方统计量

首先,我们进行一些健全性检查,以确保 pearson_chi2resid_pearson 的计算中没有基本的错误。

[30]:
res_e2.pearson_chi2, res_e.pearson_chi2, (res_e2.resid_pearson ** 2).sum(), (
    res_e.resid_pearson ** 2
).sum()
[30]:
(np.float64(894.5809002315148),
 np.float64(926.1994277566182),
 np.float64(894.5809002315148),
 np.float64(926.199427756618))
[31]:
res_e._results.resid_response.mean(), res_e.model.family.variance(res_e.mu)[
    :5
], res_e.mu[:5]
[31]:
(np.float64(-2.815935518950797e-13),
 array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]),
 array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]))
[32]:
(res_e._results.resid_response ** 2 / res_e.model.family.variance(res_e.mu)).sum()
[32]:
np.float64(926.1994277566182)
[33]:
res_e2._results.resid_response.mean(), res_e2.model.family.variance(res_e2.mu)[
    :5
], res_e2.mu[:5]
[33]:
(np.float64(-2.361188168064333e-14),
 array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]),
 array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]))
[34]:
(res_e2._results.resid_response ** 2 / res_e2.model.family.variance(res_e2.mu)).sum()
[34]:
np.float64(894.5809002315148)
[35]:
(res_e2._results.resid_response ** 2).sum(), (res_e._results.resid_response ** 2).sum()
[35]:
(np.float64(51204.85737832321), np.float64(47104.64779595939))

符号错误的一个可能原因是我们在减去不同分母下的二次项。在一些相关情况下,文献中的建议是使用公分母。我们可以比较完整模型和简化模型中使用相同方差假设的皮尔逊卡方统计量。

在这种情况下,我们在所有版本中获得了简化模型和完整模型之间相同的皮尔逊卡方缩放差异。(问题 #3616 旨在进一步跟踪此问题。)

[36]:
(
    (res_e2._results.resid_response ** 2 - res_e._results.resid_response ** 2)
    / res_e2.model.family.variance(res_e2.mu)
).sum()
[36]:
np.float64(44.43314175121899)
[37]:
(
    (res_a2._results.resid_response ** 2 - res_a._results.resid_response ** 2)
    / res_a2.model.family.variance(res_a2.mu)
    * res_a2.model.var_weights
).sum()
[37]:
np.float64(44.43314175121923)
[38]:
(
    (res_f2._results.resid_response ** 2 - res_f._results.resid_response ** 2)
    / res_f2.model.family.variance(res_f2.mu)
    * res_f2.model.freq_weights
).sum()
[38]:
np.float64(44.43314175122017)
[39]:
(
    (res_o2._results.resid_response ** 2 - res_o._results.resid_response ** 2)
    / res_o2.model.family.variance(res_o2.mu)
).sum()
[39]:
np.float64(44.43314175121962)

剩余部分

笔记本的其余部分只包含一些额外的检查,可以忽略。

[40]:
np.exp(res_e2.model.exposure)[:5], np.asarray(df_a["affairs_count"])[:5]
[40]:
(array([ 1., 10.,  5., 10.,  3.]), array([ 1, 10,  5, 10,  3]))
[41]:
res_e2.resid_pearson.sum() - res_e.resid_pearson.sum()
[41]:
np.float64(-9.664817945858474)
[42]:
res_e2.mu[:5]
[42]:
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538])
[43]:
res_a2.pearson_chi2, res_a.pearson_chi2, res_a2.resid_pearson.sum(), res_a.resid_pearson.sum()
[43]:
(np.float64(894.5809002315161),
 np.float64(926.199427756624),
 np.float64(-42.34720713518717),
 np.float64(-32.68238918932538))
[44]:
(
    (res_a2._results.resid_response ** 2)
    / res_a2.model.family.variance(res_a2.mu)
    * res_a2.model.var_weights
).sum()
[44]:
np.float64(894.5809002315161)
[45]:
(
    (res_a._results.resid_response ** 2)
    / res_a.model.family.variance(res_a.mu)
    * res_a.model.var_weights
).sum()
[45]:
np.float64(926.199427756624)
[46]:
(
    (res_a._results.resid_response ** 2)
    / res_a.model.family.variance(res_a2.mu)
    * res_a.model.var_weights
).sum()
[46]:
np.float64(850.1477584802967)
[47]:
res_e.model.endog[:5], res_e2.model.endog[:5]
[47]:
(array([ 0., 39., 17.,  9.,  4.]), array([ 0., 39., 17.,  9.,  4.]))
[48]:
res_a.model.endog[:5], res_a2.model.endog[:5]
[48]:
(array([0.        , 3.9       , 3.4       , 0.9       , 1.33333333]),
 array([0.        , 3.9       , 3.4       , 0.9       , 1.33333333]))
[49]:
res_a2.model.endog[:5] * np.exp(res_e2.model.exposure)[:5]
[49]:
array([ 0., 39., 17.,  9.,  4.])
[50]:
res_a2.model.endog[:5] * res_a2.model.var_weights[:5]
[50]:
array([ 0., 39., 17.,  9.,  4.])
[51]:
from scipy import stats

stats.chi2.sf(27.19530754604785, 1), stats.chi2.sf(29.083798806764687, 1)
[51]:
(np.float64(1.8390448369994542e-07), np.float64(6.931421143170174e-08))
[52]:
res_o.pvalues
[52]:
Intercept        3.756282e-141
rate_marriage     0.000000e+00
age               2.221918e-11
yrs_married       1.219200e-02
dtype: float64
[53]:
print(res_e2.summary())
print(res_e.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      127
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -763.61
Date:                Thu, 03 Oct 2024   Deviance:                       1013.2
Time:                        15:45:08   Pearson chi2:                     895.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.0754      0.050     41.512      0.000       1.977       2.173
rate_marriage    -0.4947      0.012    -41.743      0.000      -0.518      -0.471
yrs_married      -0.0360      0.002    -17.542      0.000      -0.040      -0.032
=================================================================================
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -740.75
Date:                Thu, 03 Oct 2024   Deviance:                       967.46
Time:                        15:45:08   Pearson chi2:                     926.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[54]:
print(res_f2.summary())
print(res_f.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6363
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10374.
Date:                Thu, 03 Oct 2024   Deviance:                       15420.
Time:                        15:45:08   Pearson chi2:                 3.24e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9729
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.0754      0.050     41.512      0.000       1.977       2.173
rate_marriage    -0.4947      0.012    -41.743      0.000      -0.518      -0.471
yrs_married      -0.0360      0.002    -17.542      0.000      -0.040      -0.032
=================================================================================
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Thu, 03 Oct 2024   Deviance:                       15375.
Time:                        15:45:08   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

最后更新:2024年10月3日