加权广义线性模型¶
[1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
加权 GLM: 泊松响应数据¶
加载数据¶
在本示例中,我们将使用 affair 数据集,使用一些外生变量来预测婚外情率。
将生成权重以显示 freq_weights
等同于重复数据记录。另一方面,var_weights
等同于聚合数据。
[2]:
print(sm.datasets.fair.NOTE)
::
Number of observations: 6366
Number of variables: 9
Variable name definitions:
rate_marriage : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
4 = good, 5 = very good
age : Age
yrs_married : No. years married. Interval approximations. See
original paper for detailed explanation.
children : No. children
religious : How relgious, 1 = not, 2 = mildly, 3 = fairly,
4 = strongly
educ : Level of education, 9 = grade school, 12 = high
school, 14 = some college, 16 = college graduate,
17 = some graduate school, 20 = advanced degree
occupation : 1 = student, 2 = farming, agriculture; semi-skilled,
or unskilled worker; 3 = white-colloar; 4 = teacher
counselor social worker, nurse; artist, writers;
technician, skilled worker, 5 = managerial,
administrative, business, 6 = professional with
advanced degree
occupation_husb : Husband's occupation. Same as occupation.
affairs : measure of time spent in extramarital affairs
See the original paper for more details.
将数据加载到 pandas 数据帧中。
[3]:
data = sm.datasets.fair.load_pandas().data
因变量 (内生变量) 是 affairs
[4]:
data.describe()
[4]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
count | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 |
mean | 4.109645 | 29.082862 | 9.009425 | 1.396874 | 2.426170 | 14.209865 | 3.424128 | 3.850141 | 0.705374 |
std | 0.961430 | 6.847882 | 7.280120 | 1.433471 | 0.878369 | 2.178003 | 0.942399 | 1.346435 | 2.203374 |
min | 1.000000 | 17.500000 | 0.500000 | 0.000000 | 1.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 |
25% | 4.000000 | 22.000000 | 2.500000 | 0.000000 | 2.000000 | 12.000000 | 3.000000 | 3.000000 | 0.000000 |
50% | 4.000000 | 27.000000 | 6.000000 | 1.000000 | 2.000000 | 14.000000 | 3.000000 | 4.000000 | 0.000000 |
75% | 5.000000 | 32.000000 | 16.500000 | 2.000000 | 3.000000 | 16.000000 | 4.000000 | 5.000000 | 0.484848 |
max | 5.000000 | 42.000000 | 23.000000 | 5.500000 | 4.000000 | 20.000000 | 6.000000 | 6.000000 | 57.599991 |
[5]:
data[:3]
[5]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 0.111111 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 3.230769 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 1.400000 |
在下文中,我们将主要使用 Poisson。虽然使用十进制婚外情有效,但我们将它们转换为整数以获得计数分布。
[6]:
data["affairs"] = np.ceil(data["affairs"])
data[:3]
[6]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 1.0 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 4.0 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 2.0 |
[7]:
(data["affairs"] == 0).mean()
[7]:
np.float64(0.6775054979579014)
[8]:
np.bincount(data["affairs"].astype(int))
[8]:
array([4313, 934, 488, 180, 130, 172, 7, 21, 67, 2, 0,
0, 17, 0, 0, 0, 3, 12, 8, 0, 0, 0,
0, 0, 2, 2, 2, 3, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1])
压缩和聚合观察结果¶
我们原始数据集中有 6366 个观察结果。当我们只考虑一些选定的变量时,我们只有更少的唯一观察结果。在下文中,我们将观察结果结合起来,首先我们将所有变量的值相同的观察结果结合起来,其次我们将具有相同解释变量的观察结果结合起来。
具有唯一观察结果的数据集¶
我们使用 pandas 的 groupby 来结合相同的观察结果,并创建一个新的变量 freq
来计算有多少个观察结果具有对应行中的值。
[9]:
data2 = data.copy()
data2["const"] = 1
dc = (
data2["affairs rate_marriage age yrs_married const".split()]
.groupby("affairs rate_marriage age yrs_married".split())
.count()
)
dc.reset_index(inplace=True)
dc.rename(columns={"const": "freq"}, inplace=True)
print(dc.shape)
dc.head()
(476, 5)
[9]:
affairs | rate_marriage | age | yrs_married | freq | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 17.5 | 0.5 | 1 |
1 | 0.0 | 1.0 | 22.0 | 2.5 | 3 |
2 | 0.0 | 1.0 | 27.0 | 2.5 | 1 |
3 | 0.0 | 1.0 | 27.0 | 6.0 | 5 |
4 | 0.0 | 1.0 | 27.0 | 9.0 | 1 |
具有唯一解释变量 (exog) 的数据集¶
对于下一个数据集,我们将具有相同解释变量值的观察结果结合起来。但是,由于响应变量在组合的观察结果之间可能不同,因此我们将计算所有组合观察结果的响应变量的平均值和总和。
我们再次使用 pandas 的 groupby
来结合观察结果并创建新的变量。我们还将 MultiIndex
展平为简单索引。
[10]:
gr = data["affairs rate_marriage age yrs_married".split()].groupby(
"rate_marriage age yrs_married".split()
)
df_a = gr.agg(["mean", "sum", "count"])
def merge_tuple(tpl):
if isinstance(tpl, tuple) and len(tpl) > 1:
return "_".join(map(str, tpl))
else:
return tpl
df_a.columns = df_a.columns.map(merge_tuple)
df_a.reset_index(inplace=True)
print(df_a.shape)
df_a.head()
(130, 6)
[10]:
rate_marriage | age | yrs_married | affairs_mean | affairs_sum | affairs_count | |
---|---|---|---|---|---|---|
0 | 1.0 | 17.5 | 0.5 | 0.000000 | 0.0 | 1 |
1 | 1.0 | 22.0 | 2.5 | 3.900000 | 39.0 | 10 |
2 | 1.0 | 27.0 | 2.5 | 3.400000 | 17.0 | 5 |
3 | 1.0 | 27.0 | 6.0 | 0.900000 | 9.0 | 10 |
4 | 1.0 | 27.0 | 9.0 | 1.333333 | 4.0 | 3 |
结合观察结果后,我们得到一个数据帧 dc
,其中包含 467 个唯一观察结果,以及一个数据帧 df_a
,其中包含 130 个具有解释变量唯一值的观察结果。
[11]:
print("number of rows: \noriginal, with unique observations, with unique exog")
data.shape[0], dc.shape[0], df_a.shape[0]
number of rows:
original, with unique observations, with unique exog
[11]:
(6366, 476, 130)
分析¶
在下文中,我们将比较原始数据的 GLM-Poisson 结果与组合观察结果的模型,其中多重性或聚合由权重或曝光给出。
原始数据¶
[12]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=data,
family=sm.families.Poisson(),
)
res_o = glm.fit()
print(res_o.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 6366
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Thu, 03 Oct 2024 Deviance: 15375.
Time: 15:45:07 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.2420
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[13]:
res_o.pearson_chi2 / res_o.df_resid
[13]:
np.float64(5.078702313363214)
压缩数据 (具有频率的唯一观察结果)¶
结合相同的观察结果并使用频率权重来考虑观察结果的多重性会产生完全相同的结果。当我们想要了解观察结果而不是所有相同观察结果的聚合时,一些结果属性会不同。例如,残差不考虑 freq_weights
。
[14]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=dc,
family=sm.families.Poisson(),
freq_weights=np.asarray(dc["freq"]),
)
res_f = glm.fit()
print(res_f.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Thu, 03 Oct 2024 Deviance: 15375.
Time: 15:45:07 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[15]:
res_f.pearson_chi2 / res_f.df_resid
[15]:
np.float64(5.078702313363196)
使用 var_weights
而不是 freq_weights
进行压缩¶
接下来,我们将比较 var_weights
和 freq_weights
。当内生变量反映平均值而不是相同的观察结果时,使用 var_weights
是常见的做法。我认为没有理论上的理由可以解释为什么它会产生相同的结果(一般来说)。
这会产生相同的结果,但 df_resid
与 freq_weights
示例不同,因为 var_weights
不会改变有效观察结果的数量。
[16]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=dc,
family=sm.families.Poisson(),
var_weights=np.asarray(dc["freq"]),
)
res_fv = glm.fit()
print(res_fv.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 472
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Thu, 03 Oct 2024 Deviance: 15375.
Time: 15:45:07 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
从结果计算出的离散度是不正确的,因为 df_resid
错误。如果我们使用原始 df_resid
,它是正确的。
[17]:
res_fv.pearson_chi2 / res_fv.df_resid, res_f.pearson_chi2 / res_f.df_resid
[17]:
(np.float64(68.45488160512002), np.float64(5.078702313363196))
聚合或平均数据 (解释变量的唯一值)¶
对于这些情况,我们将具有相同解释变量值的观察结果结合起来。相应的响应变量是所有组合观察结果的总和或平均值。
使用 exposure
¶
如果我们的因变量是所有组合观察结果的响应之和,那么在 Poisson 假设下,分布保持不变,但我们有不同的 exposure
,它由一个聚合观察结果所代表的个体数量给出。
参数估计和参数协方差与原始数据相同,但对数似然、偏差和皮尔逊卡方不同
[18]:
glm = smf.glm(
"affairs_sum ~ rate_marriage + age + yrs_married",
data=df_a,
family=sm.families.Poisson(),
exposure=np.asarray(df_a["affairs_count"]),
)
res_e = glm.fit()
print(res_e.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -740.75
Date: Thu, 03 Oct 2024 Deviance: 967.46
Time: 15:45:07 Pearson chi2: 926.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[19]:
res_e.pearson_chi2 / res_e.df_resid
[19]:
np.float64(7.35078910917951)
使用 var_weights¶
我们还可以使用所有组合观察结果的因变量的平均值。在这种情况下,方差将与一个组合观察结果所反映的总曝光的倒数相关。
[20]:
glm = smf.glm(
"affairs_mean ~ rate_marriage + age + yrs_married",
data=df_a,
family=sm.families.Poisson(),
var_weights=np.asarray(df_a["affairs_count"]),
)
res_a = glm.fit()
print(res_a.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_mean No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -5954.2
Date: Thu, 03 Oct 2024 Deviance: 967.46
Time: 15:45:07 Pearson chi2: 926.
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
比较¶
我们在上面的摘要打印中看到,params
和 cov_params
与相关的 Wald 推断在各个版本中一致。我们在下面总结了跨版本的各个结果属性的比较。
参数估计 params
、参数的标准误差 bse
和 pvalues
,用于检验参数为零的测试都一致。但是,似然和拟合优度统计量 llf
、deviance
和 pearson_chi2
仅部分一致。具体来说,聚合版本与使用原始数据的結果不一致。
警告: llf
、deviance
和 pearson_chi2
的行为可能在将来的版本中仍然会发生变化。
解释变量唯一值的响应变量的总和和平均值都具有适当的似然解释。但是,这种解释没有反映在这三个统计数据中。从计算上来说,这可能是由于使用聚合数据时缺少调整造成的。然而,从理论上讲,我们可以考虑这些情况,特别是对于 var_weights
的误判情况,此时似然分析不合适,结果应被解释为准似然估计。var_weights
的定义存在歧义,因为它们可以用于具有正确指定似然性的平均值,以及用于准似然情况下的方差调整。我们目前没有试图匹配似然规范。但是,在下一节中,我们将证明,当我们假设基础模型正确指定时,似然比类型检验仍然会为所有聚合版本产生相同的结果。
[21]:
results_all = [res_o, res_f, res_e, res_a]
names = "res_o res_f res_e res_a".split()
[22]:
pd.concat([r.params for r in results_all], axis=1, keys=names)
[22]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
截距 | 2.715533 | 2.715533 | 2.715533 | 2.715533 |
rate_marriage | -0.495180 | -0.495180 | -0.495180 | -0.495180 |
age | -0.029914 | -0.029914 | -0.029914 | -0.029914 |
yrs_married | -0.010763 | -0.010763 | -0.010763 | -0.010763 |
[23]:
pd.concat([r.bse for r in results_all], axis=1, keys=names)
[23]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
截距 | 0.107360 | 0.107360 | 0.107360 | 0.107360 |
rate_marriage | 0.011874 | 0.011874 | 0.011874 | 0.011874 |
age | 0.004471 | 0.004471 | 0.004471 | 0.004471 |
yrs_married | 0.004294 | 0.004294 | 0.004294 | 0.004294 |
[24]:
pd.concat([r.pvalues for r in results_all], axis=1, keys=names)
[24]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
截距 | 3.756282e-141 | 3.756280e-141 | 3.756282e-141 | 3.756282e-141 |
rate_marriage | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
age | 2.221918e-11 | 2.221918e-11 | 2.221918e-11 | 2.221918e-11 |
yrs_married | 1.219200e-02 | 1.219200e-02 | 1.219200e-02 | 1.219200e-02 |
[25]:
pd.DataFrame(
np.column_stack([[r.llf, r.deviance, r.pearson_chi2] for r in results_all]),
columns=names,
index=["llf", "deviance", "pearson chi2"],
)
[25]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
llf | -10350.913296 | -10350.913296 | -740.748534 | -5954.219866 |
deviance | 15374.679054 | 15374.679054 | 967.455734 | 967.455734 |
皮尔逊卡方 | 32310.704118 | 32310.704118 | 926.199428 | 926.199428 |
似然比类型检验¶
我们在上面看到,似然和相关的统计数据在聚合数据和原始数据之间不一致。我们将在下面说明,似然比检验和偏差的差异在各个版本中是一致的,但是皮尔逊卡方则不一致。
和以前一样:这还不够清楚,可能会发生变化。
作为一个测试用例,我们删除 age
变量并计算似然比类型统计量,作为简化或约束模型与完整或无约束模型之间的差异。
原始观察结果和频率权重¶
[26]:
glm = smf.glm(
"affairs ~ rate_marriage + yrs_married", data=data, family=sm.families.Poisson()
)
res_o2 = glm.fit()
# print(res_f2.summary())
res_o2.pearson_chi2 - res_o.pearson_chi2, res_o2.deviance - res_o.deviance, res_o2.llf - res_o.llf
[26]:
(np.float64(52.91343161907935),
np.float64(45.726693322505525),
np.float64(-22.863346661251853))
[27]:
glm = smf.glm(
"affairs ~ rate_marriage + yrs_married",
data=dc,
family=sm.families.Poisson(),
freq_weights=np.asarray(dc["freq"]),
)
res_f2 = glm.fit()
# print(res_f2.summary())
res_f2.pearson_chi2 - res_f.pearson_chi2, res_f2.deviance - res_f.deviance, res_f2.llf - res_f.llf
[27]:
(np.float64(52.913431618650066),
np.float64(45.726693322507344),
np.float64(-22.863346661253672))
聚合数据: exposure
和 var_weights
¶
注意: LR 检验与原始观察结果一致,pearson_chi2
不同,符号错误。
[28]:
glm = smf.glm(
"affairs_sum ~ rate_marriage + yrs_married",
data=df_a,
family=sm.families.Poisson(),
exposure=np.asarray(df_a["affairs_count"]),
)
res_e2 = glm.fit()
res_e2.pearson_chi2 - res_e.pearson_chi2, res_e2.deviance - res_e.deviance, res_e2.llf - res_e.llf
[28]:
(np.float64(-31.618527525103445),
np.float64(45.72669332250598),
np.float64(-22.863346661252763))
[29]:
glm = smf.glm(
"affairs_mean ~ rate_marriage + yrs_married",
data=df_a,
family=sm.families.Poisson(),
var_weights=np.asarray(df_a["affairs_count"]),
)
res_a2 = glm.fit()
res_a2.pearson_chi2 - res_a.pearson_chi2, res_a2.deviance - res_a.deviance, res_a2.llf - res_a.llf
[29]:
(np.float64(-31.61852752510788),
np.float64(45.72669332250621),
np.float64(-22.863346661252763))
调查皮尔逊卡方统计量¶
首先,我们进行一些健全性检查,以确保 pearson_chi2
和 resid_pearson
的计算中没有基本的错误。
[30]:
res_e2.pearson_chi2, res_e.pearson_chi2, (res_e2.resid_pearson ** 2).sum(), (
res_e.resid_pearson ** 2
).sum()
[30]:
(np.float64(894.5809002315148),
np.float64(926.1994277566182),
np.float64(894.5809002315148),
np.float64(926.199427756618))
[31]:
res_e._results.resid_response.mean(), res_e.model.family.variance(res_e.mu)[
:5
], res_e.mu[:5]
[31]:
(np.float64(-2.815935518950797e-13),
array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]),
array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]))
[32]:
(res_e._results.resid_response ** 2 / res_e.model.family.variance(res_e.mu)).sum()
[32]:
np.float64(926.1994277566182)
[33]:
res_e2._results.resid_response.mean(), res_e2.model.family.variance(res_e2.mu)[
:5
], res_e2.mu[:5]
[33]:
(np.float64(-2.361188168064333e-14),
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]),
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]))
[34]:
(res_e2._results.resid_response ** 2 / res_e2.model.family.variance(res_e2.mu)).sum()
[34]:
np.float64(894.5809002315148)
[35]:
(res_e2._results.resid_response ** 2).sum(), (res_e._results.resid_response ** 2).sum()
[35]:
(np.float64(51204.85737832321), np.float64(47104.64779595939))
符号错误的一个可能原因是我们在减去不同分母下的二次项。在一些相关情况下,文献中的建议是使用公分母。我们可以比较完整模型和简化模型中使用相同方差假设的皮尔逊卡方统计量。
在这种情况下,我们在所有版本中获得了简化模型和完整模型之间相同的皮尔逊卡方缩放差异。(问题 #3616 旨在进一步跟踪此问题。)
[36]:
(
(res_e2._results.resid_response ** 2 - res_e._results.resid_response ** 2)
/ res_e2.model.family.variance(res_e2.mu)
).sum()
[36]:
np.float64(44.43314175121899)
[37]:
(
(res_a2._results.resid_response ** 2 - res_a._results.resid_response ** 2)
/ res_a2.model.family.variance(res_a2.mu)
* res_a2.model.var_weights
).sum()
[37]:
np.float64(44.43314175121923)
[38]:
(
(res_f2._results.resid_response ** 2 - res_f._results.resid_response ** 2)
/ res_f2.model.family.variance(res_f2.mu)
* res_f2.model.freq_weights
).sum()
[38]:
np.float64(44.43314175122017)
[39]:
(
(res_o2._results.resid_response ** 2 - res_o._results.resid_response ** 2)
/ res_o2.model.family.variance(res_o2.mu)
).sum()
[39]:
np.float64(44.43314175121962)
剩余部分¶
笔记本的其余部分只包含一些额外的检查,可以忽略。
[40]:
np.exp(res_e2.model.exposure)[:5], np.asarray(df_a["affairs_count"])[:5]
[40]:
(array([ 1., 10., 5., 10., 3.]), array([ 1, 10, 5, 10, 3]))
[41]:
res_e2.resid_pearson.sum() - res_e.resid_pearson.sum()
[41]:
np.float64(-9.664817945858474)
[42]:
res_e2.mu[:5]
[42]:
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538])
[43]:
res_a2.pearson_chi2, res_a.pearson_chi2, res_a2.resid_pearson.sum(), res_a.resid_pearson.sum()
[43]:
(np.float64(894.5809002315161),
np.float64(926.199427756624),
np.float64(-42.34720713518717),
np.float64(-32.68238918932538))
[44]:
(
(res_a2._results.resid_response ** 2)
/ res_a2.model.family.variance(res_a2.mu)
* res_a2.model.var_weights
).sum()
[44]:
np.float64(894.5809002315161)
[45]:
(
(res_a._results.resid_response ** 2)
/ res_a.model.family.variance(res_a.mu)
* res_a.model.var_weights
).sum()
[45]:
np.float64(926.199427756624)
[46]:
(
(res_a._results.resid_response ** 2)
/ res_a.model.family.variance(res_a2.mu)
* res_a.model.var_weights
).sum()
[46]:
np.float64(850.1477584802967)
[47]:
res_e.model.endog[:5], res_e2.model.endog[:5]
[47]:
(array([ 0., 39., 17., 9., 4.]), array([ 0., 39., 17., 9., 4.]))
[48]:
res_a.model.endog[:5], res_a2.model.endog[:5]
[48]:
(array([0. , 3.9 , 3.4 , 0.9 , 1.33333333]),
array([0. , 3.9 , 3.4 , 0.9 , 1.33333333]))
[49]:
res_a2.model.endog[:5] * np.exp(res_e2.model.exposure)[:5]
[49]:
array([ 0., 39., 17., 9., 4.])
[50]:
res_a2.model.endog[:5] * res_a2.model.var_weights[:5]
[50]:
array([ 0., 39., 17., 9., 4.])
[51]:
from scipy import stats
stats.chi2.sf(27.19530754604785, 1), stats.chi2.sf(29.083798806764687, 1)
[51]:
(np.float64(1.8390448369994542e-07), np.float64(6.931421143170174e-08))
[52]:
res_o.pvalues
[52]:
Intercept 3.756282e-141
rate_marriage 0.000000e+00
age 2.221918e-11
yrs_married 1.219200e-02
dtype: float64
[53]:
print(res_e2.summary())
print(res_e.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 127
Model Family: Poisson Df Model: 2
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -763.61
Date: Thu, 03 Oct 2024 Deviance: 1013.2
Time: 15:45:08 Pearson chi2: 895.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.0754 0.050 41.512 0.000 1.977 2.173
rate_marriage -0.4947 0.012 -41.743 0.000 -0.518 -0.471
yrs_married -0.0360 0.002 -17.542 0.000 -0.040 -0.032
=================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -740.75
Date: Thu, 03 Oct 2024 Deviance: 967.46
Time: 15:45:08 Pearson chi2: 926.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[54]:
print(res_f2.summary())
print(res_f.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6363
Model Family: Poisson Df Model: 2
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10374.
Date: Thu, 03 Oct 2024 Deviance: 15420.
Time: 15:45:08 Pearson chi2: 3.24e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9729
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.0754 0.050 41.512 0.000 1.977 2.173
rate_marriage -0.4947 0.012 -41.743 0.000 -0.518 -0.471
yrs_married -0.0360 0.002 -17.542 0.000 -0.040 -0.032
=================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Thu, 03 Oct 2024 Deviance: 15375.
Time: 15:45:08 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================