Reputation: 1039
I'm very surprised if this kind of problems cannot be solved with sparklyr:
iris_tbl <- copy_to(sc, aDataFrame)
# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
...
aDataFrame %>% mutate(newValue=gsub("-","",d)))
...
}
I receive this error:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun
But with this line:
aDataFrame %>% mutate(newValue=toupper("hello"))
things work. Some help?
Upvotes: 3
Views: 6402
Reputation: 18585
It may be worth adding that the available documentation states:
Hive Functions
Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.
As stated in the documentation, a viable solution should be achievable with use of regexp_replace
:
Returns the string resulting from replacing all substrings in
INITIAL_STRING
that match the java regular expression syntax defined inPATTERN
with instances ofREPLACEMENT.
For example,regexp_replace("foobar", "oo|ar", "")
returns'fb.'
Note that some care is necessary in using predefined character classes: using'\s'
as the second argument will match the letters; '\\s'
is necessary to match whitespace, etc.
sparklyr
approachConsidering the above it should be possible to combine sparklyr
pipeline with
regexp_replace
to achieve effect cognate to applying gsub
on the desired column. Tested code removing the -
character within sparklyr
in variable d
could be build as follows:
aDataFrame %>%
mutate(clnD = regexp_replace(d, "-", "")) %>%
# ...
where class(aDataFrame )
returns: "tbl_spark" ...
.
Upvotes: 11
Reputation: 568
I would strongly recommend you read the sparklyr
documentation before proceeding. In particular, you're going to want to read the section on how R is translated to SQL (http://spark.rstudio.com/dplyr.html#sql_translation). In short, a very limited subset of R functions are available for use on sparklyr
dataframes, and gsub
is not one of those functions (but toupper
is). If you really need gsub
you're going to have to collect
the data in to a local dataframe, then gsub
it (you can still use mutate
), then copy_to
back to spark.
Upvotes: 3