Levi Brackman
Levi Brackman

Reputation: 355

functions for tidyr's gather and unite using

I have a data frame as follows:

library(tidyverse)
m <- matrix(rep(1:28,each = 10), ncol = 28)
colnames(m) <- c("co1","col2", LETTERS)
df <- as_tibble(m)
df

newdf<-df %>%
unite("newcol", c(col1, col2), sep = " ", remove=F)  %>%
gather("col4", "col5", A:Z)

I am trying to write the following and have it work in sparklyr:

dataframe %>%
  unite(newcol, col1, col2, sep = " ", remove=F)  %>%
  gather("col4", "col5", A:Z)

Unite: Where newcol is the new column name that contains the united values of col1 and col2.

Gather: Where A:Z are the keys that are to be placed in the new col4. col5 is the new column that will contain the variable currently in columns A:Z

Does anyone have a user-defined function of a wrapper that will do the gather and unite function within the the spraklyr package perhaps using SQL?

Upvotes: 1

Views: 865

Answers (1)

Jaime Caffarel
Jaime Caffarel

Reputation: 2469

You can do it with SQL and the DBI library.

Assuming that you have your data.frame copied in Spark. (note that you have a typo in your colnames function call, you have set the name of the first column to "co1", not "col1").

copy_to(sc, df)

Now, you can create a new Spark table with your desired columns with this:

DBI::dbSendQuery(sc, "CREATE TABLE new_df AS 
SELECT CONCAT(t1.col1, t1.col2) AS newcol, t1.col1, t1.col2, t2.col4, t2.col5 
FROM df t1 
            LATERAL VIEW EXPLODE (map('A', A, 'B', B, 'C', C, 'D', D, 'E', E, 'F', F, 'G', G, 
            'H', H, 'I', I, 'J', J, 'K', K, 'L', L, 'M', M, 'N', N, 'O', O, 'P', P, 'Q', Q, 
            'R', R, 'S', S, 'T', T, 'U', U, 'V', V, 'W', W, 'X', X, 'Y', Y, 'Z', Z)) t2 as col4, col5 ORDER BY col4 DESC")

Now, you can get the reference to this new Spark table and use it in Sparklyr:

ref_new_df <- tbl(sc, "new_df")

I hope this helps.

EDITED to show the result

> copy_to(sc, df)
> DBI::dbSendQuery(sc, "CREATE TABLE new_df AS 
+ SELECT CONCAT(t1.col1, t1.col2) AS newcol, t1.col1, t1.col2, t2.col4, t2.col5 
+ FROM df t1 
+             LATERAL VIEW EXPLODE (map('A', A, 'B', B, 'C', C, 'D', D, 'E', E, 'F', F, 'G', G, 
+             'H', H, 'I', I, 'J', J, 'K', K, 'L', L, 'M', M, 'N', N, 'O', O, 'P', P, 'Q', Q, 
+             'R', R, 'S', S, 'T', T, 'U', U, 'V', V, 'W', W, 'X', X, 'Y', Y, 'Z', Z)) t2 as col4, col5 ORDER BY col4 DESC")
<DBISparkResult>
  SQL  CREATE TABLE new_df AS 
SELECT CONCAT(t1.col1, t1.col2) AS newcol, t1.col1, t1.col2, t2.col4, t2.col5 
FROM df t1 
            LATERAL VIEW EXPLODE (map('A', A, 'B', B, 'C', C, 'D', D, 'E', E, 'F', F, 'G', G, 
            'H', H, 'I', I, 'J', J, 'K', K, 'L', L, 'M', M, 'N', N, 'O', O, 'P', P, 'Q', Q, 
            'R', R, 'S', S, 'T', T, 'U', U, 'V', V, 'W', W, 'X', X, 'Y', Y, 'Z', Z)) t2 as col4, col5 ORDER BY col4 DESC
  ROWS Fetched: 0 [complete]
       Changed: 0

Upvotes: 3

Related Questions