Creating new rows in a dataframe using column values

Question

I'm trying to format a messy dataframe in R by creating new rows from column values. A snip of the data looks like this.

id  producer  pcountry  collaborator  ccountry   val
1    J&J       USA        Pfizer       USA       25

2    Biodiem   AUS        PhaseBio     USA       65
                          GeneScience  China     

3    Shire     Ireland       N/A        N/A      54

4    Sanofi    France        N/A        N/A      64

Essentially, I want to use the values in the last two columns to create new rows in the dataframe. So far, I have this code, using the splitstackshape package.

df2 <- cSplit(df, 4, "
", "long")

This does the job for the entries that have more than one value in the collaborator column (like row 2 above). Using my code gives me this: `

id  producer  pcountry  collaborator  ccountry   val
1    J&J       USA        Pfizer       USA       25

2    Biodiem   AUS        PhaseBio     USA       65
                                       China     

3    Biodiem   AUS        Genescience  USA       65
                                       China

4    Shire     Ireland       N/A        N/A      54

5    Sanofi    France        N/A        N/A      64

However, there are more things that I want to do with the data that I am struggling with. I want the values of the collaborator column to match the values of the ccountry column, so row 3 here would have the value China in the ccountry column, while row 2 would have USA. I tried adding both columns in the code, like so df2 <- cSplit(df, c(4,5), " ", "long"), but that only makes a big mess.

Finally, because the code only makes new entries using the new line separator, it ignores the ones that only have 1 value (like row 1), because they don't have new lines. I want those to be included as well.

Is there any way to change this code to do these 2 additional steps, or would I have to write a function for that?

EDIT: This is the data snippet

     id producer pcountry collaborator              ccountry         val
                                          
1     1 J&J      USA      Pfizer                    USA               25
2     2 Biodiem  AUS      "PhaseBio
Genescience" "USA
China"    65
3     3 Shire    Ireland  NA                        NA                54
4     4 Sanofi   France   NA                        NA                64

structure(list(id = c(1, 2, 3, 4), producer = c("J&J", "Biodiem", 
"Shire", "Sanofi"), pcountry = c("USA", "AUS", "Ireland", "France"
), collaborator = c("Pfizer", "PhaseBio
Genescience", NA, 
NA), ccountry = c("USA", "USA
China", NA, NA), val = c(25, 
65, 54, 64)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", 
"data.frame"))

and this is the expected result

     id producer pcountry collaborator ccountry   val
                       
1     1 J&J      USA      NA           NA          25
2     2 J&J      USA      Pfizer       USA         25
3     3 Biodiem  AUS      NA           NA          65
4     4 Biodiem  AUS      PhaseBio     USA         65
5     5 Biodiem  AUS      Genescience  China       65
6     6 Shire    Ireland  NA           NA          54
7     7 Sanofi   France   NA           NA          64

structure(list(id = c(1, 2, 3, 4, 5, 6), producer = c("J&J", 
"J&J", "Biodiem", "Biodiem", "Biodiem", "Shire"), pcountry = c("USA", 
"USA", "AUS", "AUS", "AUS", "Ireland"), collaborator = c(NA, 
"Pfizer", NA, "PhaseBio", "Genescience", NA), ccountry = c(NA, 
"USA", NA, "USA", "China", NA), val = c(25, 25, 65, 65, 65, 54
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

iod · Accepted Answer

It's super easy with tidyr:

require(tidyr)
separate_rows(df, collaborator,ccountry, sep="
")

# A tibble: 5 x 6
     id producer pcountry collaborator ccountry   val
                       
1     1 J&J      USA      Pfizer       USA         25
2     2 Biodiem  AUS      PhaseBio     USA         65
3     2 Biodiem  AUS      Genescience  China       65
4     3 Shire    Ireland  NA           NA          54
5     4 Sanofi   France   NA           NA          64

If you want all those extra rows with NAs for collaborator and ccountry, you can do this:

require(tidyr)
require(dplyr)
df %>% mutate(collaborator=ifelse(is.na(collaborator), NA, paste0("
",collaborator)), 
    ccountry=ifelse(is.na(ccountry), NA, paste0("
",ccountry))) %>% # Create extra rows before non NA rows
  separate_rows(collaborator,ccountry, sep="
") %>% 
  mutate(collaborator=ifelse(collaborator=="",NA,collaborator), 
    ccountry=ifelse(ccountry=="", NA, ccountry)) # change empty strings to NAs
# A tibble: 7 x 6
     id producer pcountry collaborator ccountry   val
                       
1     1 J&J      USA      NA           NA          25
2     1 J&J      USA      Pfizer       USA         25
3     2 Biodiem  AUS      NA           NA          65
4     2 Biodiem  AUS      PhaseBio     USA         65
5     2 Biodiem  AUS      Genescience  China       65
6     3 Shire    Ireland  NA           NA          54
7     4 Sanofi   France   NA           NA          64

Creating new rows in a dataframe using column values

Answers (2)

Related Questions