How do I modify part of a column in a Spark data frame

Question

I am trying to modify part of a column from a Spark data frame. The row selection is based on the vector (in R env) ID.X. The replacement is another vector (in R env) Role. I have tried the following:

> sdf.bigset %>% filter(`_id` %in% ID.X) %>% 
  mutate(data_role= Role )

It crashes my R

And the following

> head(DT.XSamples)
                        _id   Role 
1: 5996e9e12a2aa6315127ed0e  Training                 
2: 5996e9e12a2aa6315127ed0f  Training                  
3: 5996e9e12a2aa6315127ed10  Training  

> setkey(DT.XSamples,`_id`)

> Lookup.XyDaRo <- function(x){
  unlist(DT.XSamples[x,Role])
}

> sdf.bigset %>% filter(`_id` %in% ID.X) %>% rowwise()%>%
  mutate(data_role= Lookup.XyDaRo(`_id`) )

As well as the following

> Fn.lookup.XyDaRo <- function(id,role){
  ifelse(is.na(role), unlist(DT.XSamples[id,Role] ),  role )
}

> sdf.bigset%>% rowwise() %>%
  mutate(data_role= Fn.lookup.XyDaRo(`_id`,data_role))

Then I get for both cases

Error: is.data.frame(data) is not TRUE

sdf.bigset is a Spark data frame. DT.XSamples is a data table living in R.

Any idea what I am doing wrong, or how it should be properly done?

zero323 · Accepted Answer

Let's say sdf.bigset looks like this:

sdf.bigset <- copy_to(sc, data.frame(`id` = 1:10, data_role = "Unknown"))

adn DT.XSamples is defined as:

XSamples <- data.frame(
  `id` = c(3, 5, 9), role = c("Training", "Dev", "Secret")
)

Convert DT.XSamples to Spark:

sdf.XSamples <- copy_to(sc, XSamples)

left_join and coalesce:

left_join(sdf.bigset, sdf.XSamples, by="id") %>% 
  mutate(data_role = coalesce(role, data_role))

# Source:   lazy query [?? x 3]
# Database: spark_connection
      id data_role role    
            
 1     1 Unknown   NA      
 2     2 Unknown   NA      
 3     3 Training  Training
 4     4 Unknown   NA      
 5     5 Dev       Dev     
 6     6 Unknown   NA      
 7     7 Unknown   NA      
 8     8 Unknown   NA      
 9     9 Secret    Secret  
10    10 Unknown   NA

Finally drop role with negative select.

Regarding your code:

Vector replacements won't work because Spark DataFrame is more relation (in relational algebra sense) not DataFrame, and in general order is not defined, therefore operations like this are not implemented.
DT variant won't work because you cannot execute plain R code, with exception to (incredibly inefficient) spark_apply.

How do I modify part of a column in a Spark data frame

Answers (1)

Related Questions