Jerome Smith-Uldall
Jerome Smith-Uldall

Reputation: 19

rxDataStep in RevoScaleR package crashing

I am trying to create a new factor column on an .xdf data set with the rxDataStep function in RevoScaleR:

rxDataStep(nyc_lab1
         , nyc_lab1
         , transforms = list(RatecodeID_desc = factor(RatecodeID, levels=RatecodeID_Levels, labels=RatecodeID_Labels))
         , overwrite=T
         )

where nyc_lab1 is a pointer to a .xdf file. I know that the file is fine because I imported it into a data table and successfully created a the new factor column.

However, I get the following error message:

Error in doTryCatch(return(expr), name, parentenv, handler) : 
  ERROR: The sample data set for the analysis has no variables.

What could be wrong?

Upvotes: 0

Views: 516

Answers (1)

Hong Ooi
Hong Ooi

Reputation: 57686

First, RevoScaleR has some warts when it comes to replacing data. In particular, overwriting the input file with the output can sometimes causes rxDataStep to fail for unknown reasons.

Even if it works, you probably shouldn't do it anyway. If there is a mistake in your code, you risk destroying your data. Instead, write to a new file each time, and only delete the old file once you've verified you no longer need it.

Second, any object you reference that isn't part of the dataset itself, has to be passed in via the transformObjects argument. See ?rxTransform. Basically the rx* functions are meant to be portable to distributed computing contexts, where the R session that runs the code isn't be the same as your local session. In this scenario, you can't assume that objects in your global environment will exist in the session where the code executes.

Try something like this:

nyc_lab2 <- RxXdfData("nyc_lab2.xdf")
nyc_lab2 <- rxDataStep(nyc_lab1, nyc_lab2,
    transforms=list(
         RatecodeID_desc=factor(RatecodeID, levels=.levs, labels=.labs)
    ),
    rxTransformObjects=list(
         .levs=RatecodeID_Levels,
         .labs=RatecodeID_Labels
    )
)

Or, you could use dplyrXdf which will handle all this file management business for you:

nyc_lab2 <- nyc_lab1 %>% factorise(RatecodeID)

Upvotes: 2

Related Questions