Using R lm function in Python on pandas DataFrame

Question

I would like to use the R function lm to calculate a linear regression in Python. My data is in the form of a pandas data frame. Like this small example:

import numpy as np
import pandas as pd
d2 = {'V1' : pd.Series([1,2,3,1,2,3,1,2,3,3]),
     'V2' : pd.Series([2,2,3,1,1,3,3,3,3,2]),
     'V3' : pd.Series([1.,2., 3., 1., 2., 3., 1., 1., 2., 2.]),
     'V4' : pd.Series([1,2,1,2,1,1,2,2,1,2])}

df2 = pd.DataFrame(d2)

I would like to run the R function lm in Python:

model = lm(V1~.,data=df2)

Calling the function with the ~. is essential for me, because my real data set is huge and I'd like to use all variables as X variables.

After that, I would like to extract a vector with column names for which the coefficients are not NA.

I've read about the rpy2 package, but I am rather a python beginner and some help would be great. All examples I have found so far, just use one X variable and no pandas DataFrame, which is not helpful for me.

Thank you!

akrun · Accepted Answer

Here is one option with pyper. Assign the object into R environment after creating the connection. Then apply the R functions on the dataset and get the output back with r.get

from pyper import *
r=R(use_pandas=True) 
r.assign("rdf2", df2)
r('model <- lm(V1~.,data=rdf2)')
r('nm1 <-  names(which(!is.na(coef(model))))[-1]')
out = r.get('nm1')
list(out)
#['V2', 'V3', 'V4']

Checking the output from R side

tmp <- read.csv('tmptest.csv')
model <- lm(V1~.,data= tmp)
nm1 <-  names(which(!is.na(coef(model))))[-1]
nm1
#[1] "V2" "V3" "V4"

Using R lm function in Python on pandas DataFrame

Answers (1)

Related Questions