Extract elements from Spark array column using SparklyR "select"

Question

I have a Spark dataframe in a SparklyR interface, and I'm trying to extract elements from an array column.

df <- copy_to(sc, data.frame(A=c(1,2),B=c(3,4)))            ## BUILD DATAFRAME
dfnew <- df %>% mutate(C=Array(A,B)) %>% select(C)          ## CREATE ARRAY COL


> dfnew                                                     ## VIEW DATAFRAME
# Source: spark [?? x 1]                       
  C        
     
1 
2 


dfnew %>% sdf_schema()                                      ## VERIFY COLUMN TYPE IS ARRAY
$C$name
[1] "C"

$C$type
[1] "ArrayType(DoubleType,true)"

I can extract an element with "mutate"...

dfnew %>% mutate(myfirst_element=C[[1]]) 

# Source: spark [?? x 2]
  C         myfirst_element
                
1                3
2                4

But I want to extract an element on the fly with "select". However, all attempts just return the full column:

> dfnew %>% select("C"[1]) 
# Source: spark [?? x 1]
  C        
     
1 
2 
> dfnew %>% select("C"[[1]]) 
# Source: spark [?? x 1]
  C        
     
1 
2 
> dfnew %>% select("C"[[1]][1]) 
# Source: spark [?? x 1]
  C        
     
1 
2 
> dfnew %>% select("C"[[1]][[1]]) 
# Source: spark [?? x 1]
  C        
     
1 
2

I've also tried using "sdf_select", without success:

> dfnew %>% sdf_select("C"[[1]][1])
# Source: spark [?? x 1]
  C        
     
1 
2

In PySpark you can access the elements explicitly e.g. col("C")[1]; in scala you can use getItem or element_at; and in SparkR you can also use element_at. But does anyone know a solution in a SparklyR setting? Thanks in advance for any help.

Marek Fiołka · Accepted Answer

The following solution came to mind.

library(tidyverse)

df = tibble(group = 1:5) %>%
  mutate(C = map(group, ~array(c(1,2),c(3,4)))) 

df
# # A tibble: 5 x 2
# group C            
#         
#   1     1 
#   2     2 
#   3     3 
#   4     4 
#   5     5 

df$C
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2
# 
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2
# 
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2
# 
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2
# 
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2



df %>% pull(C) %>% map(~.x[1,])
# [[1]]
# [1] 1 2 1 2
# 
# [[2]]
# [1] 1 2 1 2
# 
# [[3]]
# [1] 1 2 1 2
# 
# [[4]]
# [1] 1 2 1 2
# 
# [[5]]
# [1] 1 2 1 2

df %>% pull(C) %>% map(~.x[,2])
# [[1]]
# [1] 2 1 2
# 
# [[2]]
# [1] 2 1 2
# 
# [[3]]
# [1] 2 1 2
# 
# [[4]]
# [1] 2 1 2
# 
# [[5]]
# [1] 2 1 2

df %>% pull(C) %>% map(~.x[1:2,])
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# 
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# 
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# 
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# 
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1

I guess that's what you are looking for. Of course, this will also work on any array of any size.

Extract elements from Spark array column using SparklyR "select"

Answers (1)

Related Questions

Extract elements from Spark array column using SparklyR &quot;select&quot;

Answers (1)

Related Questions

Extract elements from Spark array column using SparklyR "select"