方差分量分析

此笔记本演示了针对两级嵌套和交叉设计的方差分量分析。

[1]:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.mixed_linear_model import VCSpec
import pandas as pd

使笔记本可重复

[2]:
np.random.seed(3123)

嵌套分析

在下面的讨论中,“组 2”嵌套在“组 1”中。作为一个具体的例子,“组 1”可能是学区,“组 2”可能是单个学校。下面的函数从这样的总体生成数据。在嵌套分析中,嵌套在不同组 1 标签中的组 2 标签被视为独立组,即使它们具有相同的标签。例如,两个被标记为“学校 1”的学校位于两个不同的学区,即使它们具有相同的标签,也都被视为独立的学校。

[3]:
def generate_nested(
    n_group1=200, n_group2=20, n_rep=10, group1_sd=2, group2_sd=3, unexplained_sd=4
):

    # Group 1 indicators
    group1 = np.kron(np.arange(n_group1), np.ones(n_group2 * n_rep))

    # Group 1 effects
    u = group1_sd * np.random.normal(size=n_group1)
    effects1 = np.kron(u, np.ones(n_group2 * n_rep))

    # Group 2 indicators
    group2 = np.kron(np.ones(n_group1), np.kron(np.arange(n_group2), np.ones(n_rep)))

    # Group 2 effects
    u = group2_sd * np.random.normal(size=n_group1 * n_group2)
    effects2 = np.kron(u, np.ones(n_rep))

    e = unexplained_sd * np.random.normal(size=n_group1 * n_group2 * n_rep)
    y = effects1 + effects2 + e

    df = pd.DataFrame({"y": y, "group1": group1, "group2": group2})

    return df

生成要分析的数据集。

[4]:
df = generate_nested()

使用 generate_nested 的所有默认参数,“组 1 方差”和“组 2 方差”的总体值分别为 2^2=4 和 3^2=9。未解释的方差,在摘要表顶部列为“scale”,其总体值为 4^2=16。

[5]:
model1 = sm.MixedLM.from_formula(
    "y ~ 1",
    re_formula="1",
    vc_formula={"group2": "0 + C(group2)"},
    groups="group1",
    data=df,
)
result1 = model1.fit()
print(result1.summary())
          Mixed Linear Model Regression Results
==========================================================
Model:            MixedLM Dependent Variable: y
No. Observations: 40000   Method:             REML
No. Groups:       200     Scale:              15.8825
Min. group size:  200     Log-Likelihood:     -116022.3805
Max. group size:  200     Converged:          Yes
Mean group size:  200.0
-----------------------------------------------------------
            Coef.   Std.Err.    z     P>|z|  [0.025  0.975]
-----------------------------------------------------------
Intercept   -0.035     0.149  -0.232  0.817  -0.326   0.257
group1 Var   3.917     0.112
group2 Var   8.742     0.063
==========================================================

如果我们希望避免使用公式接口,我们可以通过手动构建设计矩阵来拟合相同的模型。

[6]:
def f(x):
    n = x.shape[0]
    g2 = x.group2
    u = g2.unique()
    u.sort()
    uv = {v: k for k, v in enumerate(u)}
    mat = np.zeros((n, len(u)))
    for i in range(n):
        mat[i, uv[g2.iloc[i]]] = 1
    colnames = ["%d" % z for z in u]
    return mat, colnames

然后我们使用 VCSpec 类设置方差分量。

[7]:
vcm = df.groupby("group1").apply(f).to_list()
mats = [x[0] for x in vcm]
colnames = [x[1] for x in vcm]
names = ["group2"]
vcs = VCSpec(names, [colnames], [mats])
/tmp/ipykernel_4815/1119967950.py:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  vcm = df.groupby("group1").apply(f).to_list()

最后,我们拟合模型。可以看出,两个拟合的结果是相同的。

[8]:
oo = np.ones(df.shape[0])
model2 = sm.MixedLM(df.y, oo, exog_re=oo, groups=df.group1, exog_vc=vcs)
result2 = model2.fit()
print(result2.summary())
          Mixed Linear Model Regression Results
==========================================================
Model:            MixedLM Dependent Variable: y
No. Observations: 40000   Method:             REML
No. Groups:       200     Scale:              15.8825
Min. group size:  200     Log-Likelihood:     -116022.3805
Max. group size:  200     Converged:          Yes
Mean group size:  200.0
-----------------------------------------------------------
            Coef.   Std.Err.    z     P>|z|  [0.025  0.975]
-----------------------------------------------------------
const       -0.035     0.149  -0.232  0.817  -0.326   0.257
x_re1 Var    3.917     0.112
group2 Var   8.742     0.063
==========================================================

交叉分析

在交叉分析中,一组的水平可以与另一组的水平以任何组合出现。Statsmodels MixedLM 中的组始终是嵌套的,但可以通过仅使用一组并指定所有随机效应作为方差分量来拟合交叉模型。许多(但不是全部)交叉模型可以通过这种方式拟合。下面的函数生成具有两级随机结构的交叉数据集。

[9]:
def generate_crossed(
    n_group1=100, n_group2=100, n_rep=4, group1_sd=2, group2_sd=3, unexplained_sd=4
):

    # Group 1 indicators
    group1 = np.kron(
        np.arange(n_group1, dtype=int), np.ones(n_group2 * n_rep, dtype=int)
    )
    group1 = group1[np.random.permutation(len(group1))]

    # Group 1 effects
    u = group1_sd * np.random.normal(size=n_group1)
    effects1 = u[group1]

    # Group 2 indicators
    group2 = np.kron(
        np.arange(n_group2, dtype=int), np.ones(n_group2 * n_rep, dtype=int)
    )
    group2 = group2[np.random.permutation(len(group2))]

    # Group 2 effects
    u = group2_sd * np.random.normal(size=n_group2)
    effects2 = u[group2]

    e = unexplained_sd * np.random.normal(size=n_group1 * n_group2 * n_rep)
    y = effects1 + effects2 + e

    df = pd.DataFrame({"y": y, "group1": group1, "group2": group2})

    return df

生成要分析的数据集。

[10]:
df = generate_crossed()

接下来,我们拟合模型,注意 groups 向量是常数。使用 generate_crossed 的默认参数,级别 1 方差应为 2^2=4,级别 2 方差应为 3^2=9,未解释的方差应为 4^2=16。

[11]:
vc = {"g1": "0 + C(group1)", "g2": "0 + C(group2)"}
oo = np.ones(df.shape[0])
model3 = sm.MixedLM.from_formula("y ~ 1", groups=oo, vc_formula=vc, data=df)
result3 = model3.fit()
print(result3.summary())
          Mixed Linear Model Regression Results
==========================================================
Model:            MixedLM Dependent Variable: y
No. Observations: 40000   Method:             REML
No. Groups:       1       Scale:              15.9824
Min. group size:  40000   Log-Likelihood:     -112684.9688
Max. group size:  40000   Converged:          Yes
Mean group size:  40000.0
-----------------------------------------------------------
            Coef.   Std.Err.    z     P>|z|  [0.025  0.975]
-----------------------------------------------------------
Intercept   -0.251     0.353  -0.710  0.478  -0.943   0.442
g1 Var       4.282     0.154
g2 Var       8.150     0.291
==========================================================

如果我们希望避免使用公式接口,我们可以通过手动构建设计矩阵来拟合相同的模型。

[12]:
def f(g):
    n = len(g)
    u = g.unique()
    u.sort()
    uv = {v: k for k, v in enumerate(u)}
    mat = np.zeros((n, len(u)))
    for i in range(n):
        mat[i, uv[g[i]]] = 1
    colnames = ["%d" % z for z in u]
    return [mat], [colnames]


vcm = [f(df.group1), f(df.group2)]
mats = [x[0] for x in vcm]
colnames = [x[1] for x in vcm]
names = ["group1", "group2"]
vcs = VCSpec(names, colnames, mats)

这里我们拟合模型,不使用公式,很容易检查模型 3 和 4 的结果是否相同。

[13]:
oo = np.ones(df.shape[0])
model4 = sm.MixedLM(df.y, oo[:, None], exog_re=None, groups=oo, exog_vc=vcs)
result4 = model4.fit()
print(result4.summary())
          Mixed Linear Model Regression Results
==========================================================
Model:            MixedLM Dependent Variable: y
No. Observations: 40000   Method:             REML
No. Groups:       1       Scale:              15.9824
Min. group size:  40000   Log-Likelihood:     -112684.9688
Max. group size:  40000   Converged:          Yes
Mean group size:  40000.0
-----------------------------------------------------------
            Coef.   Std.Err.    z     P>|z|  [0.025  0.975]
-----------------------------------------------------------
const       -0.251     0.353  -0.710  0.478  -0.943   0.442
group1 Var   4.282     0.154
group2 Var   8.150     0.291
==========================================================


上次更新:2024 年 10 月 3 日