数据集包¶

statsmodels 提供数据集（即数据 *和* 元数据）供在示例、教程、模型测试等中使用。

使用来自 Stata 的数据集¶

`webuse`(data[, baseurl, as_df])	下载并返回来自 Stata 的示例数据集。

使用来自 R 的数据集¶

该 Rdatasets 项目提供对 R 的核心数据集包以及许多其他常用 R 包中可用数据集的访问。所有这些数据集都可通过使用 get_rdataset 函数供 statsmodels 使用。实际数据可通过 data 属性访问。例如

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")

In [3]: print(duncan_prestige.__doc__)
.. container::

   .. container::

      ====== ===============
      Duncan R Documentation
      ====== ===============

      .. rubric:: Duncan's Occupational Prestige Data
         :name: duncans-occupational-prestige-data

      .. rubric:: Description
         :name: description

      The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
      prestige and other characteristics of 45 U. S. occupations in
      1950.

      .. rubric:: Usage
         :name: usage

      .. code:: R

         Duncan

      .. rubric:: Format
         :name: format

      This data frame contains the following columns:

      type
         Type of occupation. A factor with the following levels:
         ``prof``, professional and managerial; ``wc``, white-collar;
         ``bc``, blue-collar.

      income
         Percentage of occupational incumbents in the 1950 US Census who
         earned $3,500 or more per year (about $36,000 in 2017 US
         dollars).

      education
         Percentage of occupational incumbents in 1950 who were high
         school graduates (which, were we cynical, we would say is
         roughly equivalent to a PhD in 2017)

      prestige
         Percentage of respondents in a social survey who rated the
         occupation as “good” or better in prestige

      .. rubric:: Source
         :name: source

      Duncan, O. D. (1961) A socioeconomic index for all occupations. In
      Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free
      Press [Table VI-1].

      .. rubric:: References
         :name: references

      Fox, J. (2016) *Applied Regression Analysis and Generalized Linear
      Models*, Third Edition. Sage.

      Fox, J. and Weisberg, S. (2019) *An R Companion to Applied
      Regression*, Third Edition, Sage.


In [4]: duncan_prestige.data.head(5)
Out[4]: 
            type  income  education  prestige
rownames                                     
accountant  prof      62         86        82
pilot       prof      72         76        83
architect   prof      75         92        90
author      prof      55         90        76
chemist     prof      64         86        90

R 数据集函数参考¶

`get_rdataset`(dataname[, package, cache])	下载并返回 R 数据集
`get_data_home`([data_home])	返回 statsmodels 数据目录的路径。
`clear_data_home`([data_home])	删除数据主缓存的所有内容。

可用数据集¶

用法¶

加载数据集

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load_pandas()

该 Dataset 对象遵循 bunch 模式。完整数据集可在 data 属性中获得。

In [7]: data.data
Out[7]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

大多数数据集在属性 endog 和 exog 中保存数据的方便表示。

In [8]: data.endog.iloc[:5]
Out[8]: 
0    60323.0
1    61122.0
2    60171.0
3    61187.0
4    63221.0
Name: TOTEMP, dtype: float64

In [9]: data.exog.iloc[:5,:]
Out[9]: 
   GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4     96.2  328975.0  2099.0  3099.0  112075.0  1951.0

但是，单变量数据集没有 exog 属性。

变量名称可以通过键入以下内容来获取

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

如果数据集没有对什么是 endog 和 exog 的明确解释，那么您始终可以访问 data 或 raw_data 属性。这适用于 macrodata 数据集，它是一组美国宏观经济数据，而不是具有特定示例的特定数据集。 data 属性包含完整数据集的记录数组，raw_data 属性包含一个 ndarray，其中列的名称由 names 属性给出。

In [12]: type(data.data)
Out[12]: pandas.core.frame.DataFrame

In [13]: type(data.raw_data)
Out[13]: pandas.core.frame.DataFrame

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

将数据加载为 pandas 对象¶

对于许多用户来说，将数据集作为 pandas DataFrame 或 Series 对象可能更可取。每个数据集模块都配备了一个 load_pandas 方法，该方法返回一个 Dataset 实例，其中数据作为 pandas 对象随时可用。

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
    83.0  234289.0  2356.0  1590.0  107608.0  1947.0
    88.5  259426.0  2325.0  1456.0  108632.0  1948.0
    88.2  258054.0  3682.0  1616.0  109773.0  1949.0
    89.5  284599.0  3351.0  1650.0  110929.0  1950.0
    96.2  328975.0  2099.0  3099.0  112075.0  1951.0
    98.1  346999.0  1932.0  3594.0  113270.0  1952.0
    99.0  365385.0  1870.0  3547.0  115094.0  1953.0
   100.0  363112.0  3578.0  3350.0  116219.0  1954.0
   101.2  397469.0  2904.0  3048.0  117388.0  1955.0
   104.6  419180.0  2822.0  2857.0  118734.0  1956.0
  108.4  442769.0  2936.0  2798.0  120445.0  1957.0
  110.8  444546.0  4681.0  2637.0  121950.0  1958.0
  112.6  482704.0  3813.0  2552.0  123366.0  1959.0
  114.2  502601.0  3931.0  2514.0  125368.0  1960.0
  115.7  518173.0  4806.0  2572.0  127852.0  1961.0
  116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [17]: data.endog
Out[17]: 
   60323.0
   61122.0
   60171.0
   61187.0
   63221.0
   63639.0
   64989.0
   63761.0
   66019.0
   67857.0
  68169.0
  66513.0
  68655.0
  69564.0
  69331.0
  70551.0
Name: TOTEMP, dtype: float64

完整 DataFrame 可在 Dataset 对象的 data 属性中获得。

In [18]: data.data
Out[18]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

通过在估计类中集成 pandas，元数据将附加到模型结果。

In [19]: y, x = data.endog, data.exog

In [20]: res = sm.OLS(y, x).fit()

In [21]: res.params
Out[21]: 
GNPDEFL   -52.993570
GNP         0.071073
UNEMP      -0.423466
ARMED      -0.572569
POP        -0.414204
YEAR       48.417866
dtype: float64

In [22]: res.summary()
Out[22]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 TOTEMP   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          5.052e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):                    8.20e-22
Time:                        16:08:41   Log-Likelihood:                         -117.56
No. Observations:                  16   AIC:                                      247.1
Df Residuals:                      10   BIC:                                      251.8
Df Model:                           6                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GNPDEFL      -52.9936    129.545     -0.409      0.691    -341.638     235.650
GNP            0.0711      0.030      2.356      0.040       0.004       0.138
UNEMP         -0.4235      0.418     -1.014      0.335      -1.354       0.507
ARMED         -0.5726      0.279     -2.052      0.067      -1.194       0.049
POP           -0.4142      0.321     -1.289      0.226      -1.130       0.302
YEAR          48.4179     17.689      2.737      0.021       9.003      87.832
==============================================================================
Omnibus:                        1.443   Durbin-Watson:                   1.277
Prob(Omnibus):                  0.486   Jarque-Bera (JB):                0.605
Skew:                           0.476   Prob(JB):                        0.739
Kurtosis:                       3.031   Cond. No.                     4.56e+05
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 4.56e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

额外信息¶

如果您想了解有关数据集本身的更多信息，您可以访问以下内容，再次使用 Longley 数据集作为示例

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

其他信息¶

数据集包的想法最初是由 David Cournapeau 提出的。
要添加数据集，请参见有关添加数据集的说明。

上次更新时间：2024 年 10 月 3 日