SpiritusPrana
SpiritusPrana

Reputation: 480

Use of substr() on DataFrame column in SparkR

I am using SparkR and want to use the substr() command to isolate the last character of a string that is contained in a column. I can get substr() to work if I set the StartPosition and EndPosition to a constant:

substr(sdfIris$Species, 8, 8)

But when I try to set these parameters using a value sourced from the DataFrame:

sdfIris <- createDataFrame(sqlContext, iris)
sdfIris$Len <- length(sdfIris$Species)
sdfIris$Last <- substr(sdfIris$Species, sdfIris$Len, sdfIris$Len)

Error in as.integer(start - 1) : cannot coerce type 'S4' to vector of type 'integer'

It seems that the result being returned from sdfIris$Len is perhaps a one-cell DataFrame, and the parameter needs an integer.

I have tried collect(sdfIris$Len), but:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"Column"’

This seems incongruous. substr() seems to see sdfIris$Len as a DataFrame, but collect() seems to see it as a Column.

I have already identified a work-around by using registerTempTable and using SparkSQL's substr to isolate the last character, but I was hoping to avoid the unnecessary steps of switching to SQL.

How can I use SparkR substr() on a DataFrame column with dynamic Start and Finish parameters?

Upvotes: 2

Views: 1378

Answers (1)

zero323
zero323

Reputation: 330413

It is not optimal but you can use expr:

df <- createDataFrame(
  sqlContext,
  data.frame(s=c("foo", "bar", "foobar"), from=c(1, 2, 0), to=c(2, 3, 5))
)

select(df, expr("substr(s, from, to)")) %>% head()

##   substr(s,from,to)
## 1                fo
## 2                ar
## 3             fooba

or selectExpr:

selectExpr(df, "substr(s, from, to)") %>% head()

##   substr(s,from,to)
## 1                fo
## 2                ar
## 3             fooba

as well as equivalent SQL query.

Upvotes: 3

Related Questions