LondonRob
LondonRob

Reputation: 78903

Retrieve statistics from R lm regression into pandas with rpy2

Inspired by the linear models example from the docs, I'd like to print a nice summary after running an lm command.

When I run (see the final line in the example)

print(base.summary(stats.lm('foo ~ bar'))

I get a whole function listing which starts as follows:

Call:
(function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
{
    ret.x <- x
    ret.y <- y
    cl <- match.call()
    mf <- match.call(expand.dots = FALSE)

With the desired R output at the bottom:

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
foo        5.0320     0.2202   22.85 9.55e-15 ***
bar        4.6610     0.2202   21.16 3.62e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared:  0.9818,    Adjusted R-squared:  0.9798 
F-statistic: 485.1 on 2 and 18 DF,  p-value: < 2.2e-16

This is moderately problematic, but becomes unworkable when the data being fed to lm is a pandas.DataFrame, because base.summary seems to want to print all the data.

Is there a way to just get the nice formatted R output in a pd.DataFrame without all the extra gubbins?

Upvotes: 3

Views: 1136

Answers (1)

LondonRob
LondonRob

Reputation: 78903

For posterity, here's a really nice way to get the numbers from an lm back into a pd.DataFrame (thanks to @Metrics for the tip-off about broom)

def _run_regression(data, y_name):
    """
    Run a linear regression, in R, using `data` with dependent variable
    `y_name` and independent variables all other columns of `data`.
    """
    from rpy2.robjects.packages import importr
    stats = importr('stats')
    broom = importr('broom')
    lm = broom.tidy(stats.lm('%s ~ . ' % y_name, data=data))
    return _extract_R_df(lm).set_index('term')

def _extract_R_df(df):
    """
    Extract the R DataFrame `df` as a pd.DataFrame. This slightly
    longer method is necessary because `np.asarray(df)` drops the
    exponent on very small numbers!
    """
    return pd.DataFrame({name:np.asarray(df.rx(name))[0] for name in df.names})

Which results in a DataFrame similar to this:

                 estimate   p.value     statistic     std.error
term                                                           
(Intercept) -3.709995e-16  0.000056 -4.712554e+00  7.872579e-17
x_is         8.000000e-01  0.000000  1.067919e+16  7.491204e-17
v_is         2.000000e-01  0.000000  2.107838e+15  9.488394e-17
d_ij        -2.000000e-01  0.000000 -2.970482e+14  6.732913e-16
d1           1.000000e-01  0.000000  4.045155e+14  2.472093e-16
d2           3.000000e-01  0.000000  5.320521e+14  5.638545e-16
d3           7.000000e-01  0.000000  1.779338e+15  3.934048e-16

Upvotes: 2

Related Questions