数据集包

statsmodels 提供数据集(即数据 *和* 元数据)供在示例、教程、模型测试等中使用。

使用来自 Stata 的数据集

webuse(data[, baseurl, as_df])

下载并返回来自 Stata 的示例数据集。

使用来自 R 的数据集

Rdatasets 项目 提供对 R 的核心数据集包以及许多其他常用 R 包中可用数据集的访问。所有这些数据集都可通过使用 get_rdataset 函数供 statsmodels 使用。实际数据可通过 data 属性访问。例如

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")

In [3]: print(duncan_prestige.__doc__)
.. container::

   .. container::

      ====== ===============
      Duncan R Documentation
      ====== ===============

      .. rubric:: Duncan's Occupational Prestige Data
         :name: duncans-occupational-prestige-data

      .. rubric:: Description
         :name: description

      The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
      prestige and other characteristics of 45 U. S. occupations in
      1950.

      .. rubric:: Usage
         :name: usage

      .. code:: R

         Duncan

      .. rubric:: Format
         :name: format

      This data frame contains the following columns:

      type
         Type of occupation. A factor with the following levels:
         ``prof``, professional and managerial; ``wc``, white-collar;
         ``bc``, blue-collar.

      income
         Percentage of occupational incumbents in the 1950 US Census who
         earned $3,500 or more per year (about $36,000 in 2017 US
         dollars).

      education
         Percentage of occupational incumbents in 1950 who were high
         school graduates (which, were we cynical, we would say is
         roughly equivalent to a PhD in 2017)

      prestige
         Percentage of respondents in a social survey who rated the
         occupation as “good” or better in prestige

      .. rubric:: Source
         :name: source

      Duncan, O. D. (1961) A socioeconomic index for all occupations. In
      Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free
      Press [Table VI-1].

      .. rubric:: References
         :name: references

      Fox, J. (2016) *Applied Regression Analysis and Generalized Linear
      Models*, Third Edition. Sage.

      Fox, J. and Weisberg, S. (2019) *An R Companion to Applied
      Regression*, Third Edition, Sage.


In [4]: duncan_prestige.data.head(5)
Out[4]: 
            type  income  education  prestige
rownames                                     
accountant  prof      62         86        82
pilot       prof      72         76        83
architect   prof      75         92        90
author      prof      55         90        76
chemist     prof      64         86        90

R 数据集函数参考

get_rdataset(dataname[, package, cache])

下载并返回 R 数据集

get_data_home([data_home])

返回 statsmodels 数据目录的路径。

clear_data_home([data_home])

删除数据主缓存的所有内容。

可用数据集

用法

加载数据集

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load_pandas()

Dataset 对象遵循 bunch 模式。完整数据集可在 data 属性中获得。

In [7]: data.data
Out[7]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0   60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1   61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2   60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3   61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4   63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5   63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6   64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7   63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8   66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9   67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10  68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11  66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12  68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13  69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14  69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15  70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

大多数数据集在属性 endogexog 中保存数据的方便表示。

In [8]: data.endog.iloc[:5]
Out[8]: 
0    60323.0
1    61122.0
2    60171.0
3    61187.0
4    63221.0
Name: TOTEMP, dtype: float64

In [9]: data.exog.iloc[:5,:]
Out[9]: 
   GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4     96.2  328975.0  2099.0  3099.0  112075.0  1951.0

但是,单变量数据集没有 exog 属性。

变量名称可以通过键入以下内容来获取

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

如果数据集没有对什么是 endogexog 的明确解释,那么您始终可以访问 dataraw_data 属性。这适用于 macrodata 数据集,它是一组美国宏观经济数据,而不是具有特定示例的特定数据集。 data 属性包含完整数据集的记录数组,raw_data 属性包含一个 ndarray,其中列的名称由 names 属性给出。

In [12]: type(data.data)
Out[12]: pandas.core.frame.DataFrame

In [13]: type(data.raw_data)
Out[13]: pandas.core.frame.DataFrame

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

将数据加载为 pandas 对象

对于许多用户来说,将数据集作为 pandas DataFrame 或 Series 对象可能更可取。每个数据集模块都配备了一个 load_pandas 方法,该方法返回一个 Dataset 实例,其中数据作为 pandas 对象随时可用。

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0      83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1      88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2      88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3      89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4      96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5      98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6      99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7     100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8     101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9     104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [17]: data.endog
Out[17]: 
0     60323.0
1     61122.0
2     60171.0
3     61187.0
4     63221.0
5     63639.0
6     64989.0
7     63761.0
8     66019.0
9     67857.0
10    68169.0
11    66513.0
12    68655.0
13    69564.0
14    69331.0
15    70551.0
Name: TOTEMP, dtype: float64

完整 DataFrame 可在 Dataset 对象的 data 属性中获得。

In [18]: data.data
Out[18]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0   60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1   61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2   60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3   61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4   63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5   63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6   64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7   63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8   66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9   67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10  68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11  66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12  68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13  69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14  69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15  70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

通过在估计类中集成 pandas,元数据将附加到模型结果。

In [19]: y, x = data.endog, data.exog

In [20]: res = sm.OLS(y, x).fit()

In [21]: res.params
Out[21]: 
GNPDEFL   -52.993570
GNP         0.071073
UNEMP      -0.423466
ARMED      -0.572569
POP        -0.414204
YEAR       48.417866
dtype: float64

In [22]: res.summary()
Out[22]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 TOTEMP   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          5.052e+04
Date:                Thu, 03 Oct 2024   Prob (F-statistic):                    8.20e-22
Time:                        16:08:41   Log-Likelihood:                         -117.56
No. Observations:                  16   AIC:                                      247.1
Df Residuals:                      10   BIC:                                      251.8
Df Model:                           6                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GNPDEFL      -52.9936    129.545     -0.409      0.691    -341.638     235.650
GNP            0.0711      0.030      2.356      0.040       0.004       0.138
UNEMP         -0.4235      0.418     -1.014      0.335      -1.354       0.507
ARMED         -0.5726      0.279     -2.052      0.067      -1.194       0.049
POP           -0.4142      0.321     -1.289      0.226      -1.130       0.302
YEAR          48.4179     17.689      2.737      0.021       9.003      87.832
==============================================================================
Omnibus:                        1.443   Durbin-Watson:                   1.277
Prob(Omnibus):                  0.486   Jarque-Bera (JB):                0.605
Skew:                           0.476   Prob(JB):                        0.739
Kurtosis:                       3.031   Cond. No.                     4.56e+05
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 4.56e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

额外信息

如果您想了解有关数据集本身的更多信息,您可以访问以下内容,再次使用 Longley 数据集作为示例

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

其他信息


上次更新时间:2024 年 10 月 3 日