回归诊断¶

此示例文件展示如何在真实场景中使用一些 statsmodels 回归诊断检验。您可以在回归诊断页面上了解有关更多检验的信息以及有关这些检验的更多信息。

请注意，这里描述的大多数检验只返回一个数字元组，没有任何注释。完整的输出描述始终包含在文档字符串中，以及在线 statsmodels 文档中。出于演示目的，我们在下面的示例中使用 zip(name,test) 结构来美化打印简短描述。

估计回归模型¶

[1]:

%matplotlib inline

[2]:

from statsmodels.compat import lzip

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
import matplotlib.pyplot as plt

# Load data
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/HistData/Guerry.csv"
dat = pd.read_csv(url)

# Fit regression model (using the natural log of one of the regressors)
results = smf.ols("Lottery ~ Literacy + np.log(Pop1831)", data=dat).fit()

# Inspect the results
print(results.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           1.90e-08
Time:                        16:05:44   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2
Covariance Type:            nonrobust
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         246.4341     35.233      6.995      0.000     176.358     316.510
Literacy           -0.4889      0.128     -3.832      0.000      -0.743      -0.235
np.log(Pop1831)   -31.3114      5.977     -5.239      0.000     -43.199     -19.424
==============================================================================
Omnibus:                        3.713   Durbin-Watson:                   2.019
Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
Skew:                          -0.487   Prob(JB):                        0.183
Kurtosis:                       3.003   Cond. No.                         702.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

残差的正态性¶

Jarque-Bera 检验

[3]:

name = ["Jarque-Bera", "Chi^2 two-tail prob.", "Skew", "Kurtosis"]
test = sms.jarque_bera(results.resid)
lzip(name, test)

[3]:

[('Jarque-Bera', np.float64(3.39360802484318)),
 ('Chi^2 two-tail prob.', np.float64(0.18326831231663254)),
 ('Skew', np.float64(-0.4865803431122347)),
 ('Kurtosis', np.float64(3.003417757881634))]

Omni 检验

[4]:

name = ["Chi^2", "Two-tail probability"]
test = sms.omni_normtest(results.resid)
lzip(name, test)

[4]:

[('Chi^2', np.float64(3.7134378115971933)),
 ('Two-tail probability', np.float64(0.15618424580304735))]

影响检验¶

创建后，OLSInfluence 类的对象将保存属性和方法，允许用户评估每个观测的影响。例如，我们可以计算并提取 DFbetas 的前几行

[5]:

from statsmodels.stats.outliers_influence import OLSInfluence

test_class = OLSInfluence(results)
test_class.dfbetas[:5, :]

[5]:

array([[-0.00301154,  0.00290872,  0.00118179],
       [-0.06425662,  0.04043093,  0.06281609],
       [ 0.01554894, -0.03556038, -0.00905336],
       [ 0.17899858,  0.04098207, -0.18062352],
       [ 0.29679073,  0.21249207, -0.3213655 ]])

通过键入 dir(influence_test) 探索其他选项

杠杆的相关信息也可以绘制出来

[6]:

from statsmodels.graphics.regressionplots import plot_leverage_resid2

fig, ax = plt.subplots(figsize=(8, 6))
fig = plot_leverage_resid2(results, ax=ax)

../../../_images/examples_notebooks_generated_regression_diagnostics_13_0.png

可以在图形页面上找到其他绘图选项。

多重共线性¶

条件数

[7]:

np.linalg.cond(results.model.exog)

[7]:

np.float64(702.1792145490066)

异方差检验¶

Breush-Pagan 检验

[8]:

name = ["Lagrange multiplier statistic", "p-value", "f-value", "f p-value"]
test = sms.het_breuschpagan(results.resid, results.model.exog)
lzip(name, test)

[8]:

[('Lagrange multiplier statistic', np.float64(4.893213374094005)),
 ('p-value', np.float64(0.08658690502352002)),
 ('f-value', np.float64(2.5037159462564618)),
 ('f p-value', np.float64(0.08794028782672814))]

Goldfeld-Quandt 检验

[9]:

name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(results.resid, results.model.exog)
lzip(name, test)

[9]:

[('F statistic', np.float64(1.1002422436378143)),
 ('p-value', np.float64(0.38202950686925324))]

线性¶

Harvey-Collier 乘数检验，针对原假设线性规范是正确的

[10]:

name = ["t value", "p value"]
test = sms.linear_harvey_collier(results)
lzip(name, test)

[10]:

[('t value', np.float64(-1.0796490077759802)),
 ('p value', np.float64(0.2834639247569222))]

最后更新：2024 年 10 月 3 日