Patsy:用于分类变量的对比编码系统

注意

本文档基于 UCLA 的此优秀资源.

一个具有 K 个类别或水平的分类变量,通常以 K-1 个虚拟变量的形式进入回归模型。这相当于对水平均值进行线性假设。也就是说,这些变量的每个检验统计量都相当于检验该水平的均值是否与基准类别的均值在统计学上显著不同。这种虚拟编码在 R 的术语中称为处理编码,我们将遵循此约定。但是,还有其他编码方法,这些方法等同于不同的线性假设集。

实际上,虚拟编码在技术上不是对比编码。这是因为虚拟变量加起来等于 1,并且在功能上独立于模型的截距。另一方面,具有 k 个水平的分类变量的一组对比k-1 个函数独立的因子水平均值的线性组合,它们也独立于虚拟变量的总和。虚拟编码本身并不错误。它捕获了所有系数,但在模型假设系数的独立性(如 ANOVA)时会使问题变得复杂。线性回归模型不假设系数的独立性,因此虚拟编码通常是这种情况下教授的唯一编码。

要查看 Patsy 中的对比矩阵,我们将使用 UCLA ATS 的数据。首先让我们加载数据。

示例数据

In [1]: import pandas

In [2]: url = 'https://stats.idre.ucla.edu/stat/data/hsb2.csv'

In [3]: hsb2 = pandas.read_csv(url)

查看因变量的均值,即每个种族水平的 write((1 = 西班牙裔,2 = 亚洲人,3 = 非洲裔美国人和 4 = 白人))将是有益的。

In [4]: hsb2.groupby('race')['write'].mean()
Out[4]: 
race
1    46.458333
2    58.000000
3    48.200000
4    54.055172
Name: write, dtype: float64

处理(虚拟)编码

虚拟编码可能是最著名的编码方案。它将分类变量的每个水平与一个基本参考水平进行比较。基本参考水平是截距的值。它是 Patsy 中对无序分类因子的默认对比。种族变量的处理对比矩阵将是

In [5]: from patsy.contrasts import Treatment

In [6]: levels = [1,2,3,4]

In [7]: contrast = Treatment(reference=0).code_without_intercept(levels)

In [8]: print(contrast.matrix)
[[0. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

这里我们使用了 reference=0,这意味着第一个水平,西班牙裔,是衡量其他水平效应的参考类别。如上所述,列的总和不为零,因此它们不独立于截距。为了明确起见,让我们看看这将如何对 race 变量进行编码。

In [9]: contrast.matrix[hsb2.race-1, :][:20]
Out[9]: 
array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

这是一个小技巧,因为 race 类别方便地映射到基于零的索引。如果它没有映射,则这种转换将在后台发生,因此这在一般情况下将不起作用,但仍然是一个用于修正想法的有用练习。下面说明了使用上述三个对比输出的结果

In [10]: from statsmodels.formula.api import ols

In [11]: mod = ols("write ~ C(race, Treatment)", data=hsb2)

In [12]: res = mod.fit()

In [13]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        16:08:41   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  46.4583      1.842     25.218      0.000      42.825      50.091
C(race, Treatment)[T.2]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Treatment)[T.3]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Treatment)[T.4]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         8.25
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

我们明确地给出了种族变量的对比;然而,由于 Treatment 是默认值,我们可以省略它。

简单编码

与处理编码一样,简单编码将每个水平与一个固定的参考水平进行比较。但是,在简单编码中,截距是所有因子水平的总体均值。有关如何实现简单对比的信息,请参见 用户定义的编码

In [14]: contrast = Simple().code_without_intercept(levels)

In [15]: print(contrast.matrix)
[[-0.25 -0.25 -0.25]
 [ 0.75 -0.25 -0.25]
 [-0.25  0.75 -0.25]
 [-0.25 -0.25  0.75]]

In [16]: mod = ols("write ~ C(race, Simple)", data=hsb2)

In [17]: res = mod.fit()

In [18]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        16:08:41   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Simple)[Simp.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Simple)[Simp.2]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Simple)[Simp.3]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

总和(偏差)编码

总和编码将给定水平的因变量均值与因变量在所有水平上的总体均值进行比较。也就是说,它使用前 k-1 个水平与水平 k 之间的对比。在本例中,水平 1 与所有其他水平进行比较,水平 2 与所有其他水平进行比较,水平 3 与所有其他水平进行比较。

In [19]: from patsy.contrasts import Sum

In [20]: contrast = Sum().code_without_intercept(levels)

In [21]: print(contrast.matrix)
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [-1. -1. -1.]]

In [22]: mod = ols("write ~ C(race, Sum)", data=hsb2)

In [23]: res = mod.fit()

In [24]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        16:08:41   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Sum)[S.1]    -5.2200      1.631     -3.200      0.002      -8.437      -2.003
C(race, Sum)[S.2]     6.3216      2.160      2.926      0.004       2.061      10.582
C(race, Sum)[S.3]    -3.4784      1.732     -2.008      0.046      -6.895      -0.062
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         6.72
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

这对应于一种参数化,它迫使所有系数加起来等于零。请注意,这里的截距是总体均值,其中总体均值是每个水平的因变量均值的均值。

In [25]: hsb2.groupby('race')['write'].mean().mean()
Out[25]: np.float64(51.67837643678162)

反向差分编码

在反向差分编码中,将某个水平的因变量均值与前一个水平的因变量均值进行比较。这种类型的编码可能对名义变量或序数变量有用。

In [26]: from patsy.contrasts import Diff

In [27]: contrast = Diff().code_without_intercept(levels)

In [28]: print(contrast.matrix)
[[-0.75 -0.5  -0.25]
 [ 0.25 -0.5  -0.25]
 [ 0.25  0.5  -0.25]
 [ 0.25  0.5   0.75]]

In [29]: mod = ols("write ~ C(race, Diff)", data=hsb2)

In [30]: res = mod.fit()

In [31]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        16:08:41   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept             51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Diff)[D.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Diff)[D.2]    -9.8000      3.388     -2.893      0.004     -16.481      -3.119
C(race, Diff)[D.3]     5.8552      2.153      2.720      0.007       1.610      10.101
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         8.30
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

例如,这里水平 1 上的系数是 write 在水平 2 上的均值与水平 1 上的均值之间的差。即,

In [32]: res.params["C(race, Diff)[D.1]"]
Out[32]: np.float64(11.541666666666575)

In [33]: hsb2.groupby('race').mean()["write"].loc[2] - \
   ....:     hsb2.groupby('race').mean()["write"].loc[1]
   ....: 
Out[33]: np.float64(11.541666666666664)

Helmert 编码

我们版本的 Helmert 编码有时被称为逆 Helmert 编码。将某个水平的因变量均值与所有先前水平的因变量均值进行比较。因此,有时使用“逆”这个名称来区分它与正向 Helmert 编码。这种比较对诸如种族之类的名义变量没有多大意义,但我们将像这样使用 Helmert 对比

In [34]: from patsy.contrasts import Helmert

In [35]: contrast = Helmert().code_without_intercept(levels)

In [36]: print(contrast.matrix)
[[-1. -1. -1.]
 [ 1. -1. -1.]
 [ 0.  2. -1.]
 [ 0.  0.  3.]]

In [37]: mod = ols("write ~ C(race, Helmert)", data=hsb2)

In [38]: res = mod.fit()

In [39]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        16:08:41   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Helmert)[H.2]     5.7708      1.643      3.512      0.001       2.530       9.011
C(race, Helmert)[H.3]    -1.3431      0.867     -1.548      0.123      -3.054       0.368
C(race, Helmert)[H.4]     0.7923      0.372      2.130      0.034       0.059       1.526
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.26
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

为了说明,水平 4 上的比较是前三个水平的因变量均值与水平 4 上的均值之间的差

In [40]: grouped = hsb2.groupby('race')

In [41]: grouped.mean()["write"].loc[4] - grouped.mean()["write"].loc[:3].mean()
Out[41]: np.float64(3.169061302681982)

如您所见,这些仅在常数上相等。Helmert 对比的其他版本给出了均值之间的实际差异。无论如何,假设检验都是相同的。

In [42]: k = 4

In [43]: 1./k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[:k-1].mean())
Out[43]: np.float64(0.7922653256704955)

In [44]: k = 3

In [45]: 1./k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[:k-1].mean())
Out[45]: np.float64(-1.3430555555555561)

正交多项式编码

对于 k=4 个水平,多项式编码采用的系数是分类变量中的线性、二次和三次趋势。这里假设分类变量由一个潜在的、等距的数值变量表示。因此,这种类型的编码仅用于具有等距的顺序分类变量。一般来说,多项式对比会产生 k-1 阶的多项式。由于 race 不是一个有序因子变量,让我们以 read 为例。首先,我们需要从 read 创建一个有序分类变量。

In [46]: _, bins = np.histogram(hsb2.read, 3)

In [47]: try: # requires numpy main
   ....:     readcat = np.digitize(hsb2.read, bins, True)
   ....: except:
   ....:     readcat = np.digitize(hsb2.read, bins)
   ....: 

In [48]: hsb2['readcat'] = readcat

In [49]: hsb2.groupby('readcat').mean()['write']
Out[49]: 
readcat
0    46.000000
1    44.980392
2    53.356436
3    60.127660
Name: write, dtype: float64
In [50]: from patsy.contrasts import Poly

In [51]: levels = hsb2.readcat.unique().tolist()

In [52]: contrast = Poly().code_without_intercept(levels)

In [53]: print(contrast.matrix)
[[-0.6708  0.5    -0.2236]
 [-0.2236 -0.5     0.6708]
 [ 0.2236 -0.5    -0.6708]
 [ 0.6708  0.5     0.2236]]

In [54]: mod = ols("write ~ C(readcat, Poly)", data=hsb2)

In [55]: res = mod.fit()

In [56]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.320
Model:                            OLS   Adj. R-squared:                  0.309
Method:                 Least Squares   F-statistic:                     30.73
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           2.51e-16
Time:                        16:08:41   Log-Likelihood:                -694.54
No. Observations:                 200   AIC:                             1397.
Df Residuals:                     196   BIC:                             1410.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                     51.1161      2.018     25.324      0.000      47.135      55.097
C(readcat, Poly).Linear       11.3501      5.348      2.122      0.035       0.803      21.897
C(readcat, Poly).Quadratic     3.8954      4.037      0.965      0.336      -4.066      11.857
C(readcat, Poly).Cubic        -2.4598      1.998     -1.231      0.220      -6.400       1.480
==============================================================================
Omnibus:                        9.741   Durbin-Watson:                   1.699
Prob(Omnibus):                  0.008   Jarque-Bera (JB):               10.263
Skew:                          -0.535   Prob(JB):                      0.00591
Kurtosis:                       2.703   Cond. No.                         13.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

如您所见,readcat 对因变量 write 有显著的线性影响,但没有显著的二次或三次影响。

用户定义的编码

如果您想使用自己的编码,您必须通过编写一个包含 code_with_intercept 和 code_without_intercept 方法的编码类来实现,这些方法返回一个 patsy.contrast.ContrastMatrix 实例。

In [57]: from patsy.contrasts import ContrastMatrix
   ....: 
   ....: def _name_levels(prefix, levels):
   ....:     return ["[%s%s]" % (prefix, level) for level in levels]
   ....: 

In [58]: class Simple(object):
   ....:     def _simple_contrast(self, levels):
   ....:         nlevels = len(levels)
   ....:         contr = -1./nlevels * np.ones((nlevels, nlevels-1))
   ....:         contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels
   ....:         return contr
   ....: 
   ....:     def code_with_intercept(self, levels):
   ....:         contrast = np.column_stack((np.ones(len(levels)),
   ....:                                    self._simple_contrast(levels)))
   ....:         return ContrastMatrix(contrast, _name_levels("Simp.", levels))
   ....: 
   ....:    def code_without_intercept(self, levels):
   ....:        contrast = self._simple_contrast(levels)
   ....:        return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1]))
   ....: 
  File <tokenize>:13
    def code_without_intercept(self, levels):
    ^
IndentationError: unindent does not match any outer indentation level


In [60]: mod = ols("write ~ C(race, Simple)", data=hsb2)
   ....: res = mod.fit()
   ....: print(res.summary())
   ....: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        16:08:41   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Simple)[Simp.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Simple)[Simp.2]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Simple)[Simp.3]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

最后更新时间:2024 年 10 月 3 日