Reputation: 133
I'm trying to format a messy dataframe in R by creating new rows from column values. A snip of the data looks like this.
id producer pcountry collaborator ccountry val
1 J&J USA Pfizer USA 25
2 Biodiem AUS PhaseBio USA 65
GeneScience China
3 Shire Ireland N/A N/A 54
4 Sanofi France N/A N/A 64
Essentially, I want to use the values in the last two columns to create new rows in the dataframe. So far, I have this code, using the splitstackshape
package.
df2 <- cSplit(df, 4, "\r", "long")
This does the job for the entries that have more than one value in the collaborator column (like row 2 above). Using my code gives me this: `
id producer pcountry collaborator ccountry val
1 J&J USA Pfizer USA 25
2 Biodiem AUS PhaseBio USA 65
China
3 Biodiem AUS Genescience USA 65
China
4 Shire Ireland N/A N/A 54
5 Sanofi France N/A N/A 64
However, there are more things that I want to do with the data that I am struggling with. I want the values of the collaborator column to match the values of the ccountry column, so row 3 here would have the value China
in the ccountry
column, while row 2 would have USA
. I tried adding both columns in the code, like so df2 <- cSplit(df, c(4,5), "\r", "long")
, but that only makes a big mess.
Finally, because the code only makes new entries using the new line separator, it ignores the ones that only have 1 value (like row 1), because they don't have new lines. I want those to be included as well.
Is there any way to change this code to do these 2 additional steps, or would I have to write a function for that?
EDIT: This is the data snippet
id producer pcountry collaborator ccountry val
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1 J&J USA Pfizer USA 25
2 2 Biodiem AUS "PhaseBio\r\nGenescience" "USA\r\nChina" 65
3 3 Shire Ireland NA NA 54
4 4 Sanofi France NA NA 64
structure(list(id = c(1, 2, 3, 4), producer = c("J&J", "Biodiem",
"Shire", "Sanofi"), pcountry = c("USA", "AUS", "Ireland", "France"
), collaborator = c("Pfizer", "PhaseBio\r\nGenescience", NA,
NA), ccountry = c("USA", "USA\r\nChina", NA, NA), val = c(25,
65, 54, 64)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
and this is the expected result
id producer pcountry collaborator ccountry val
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1 J&J USA NA NA 25
2 2 J&J USA Pfizer USA 25
3 3 Biodiem AUS NA NA 65
4 4 Biodiem AUS PhaseBio USA 65
5 5 Biodiem AUS Genescience China 65
6 6 Shire Ireland NA NA 54
7 7 Sanofi France NA NA 64
structure(list(id = c(1, 2, 3, 4, 5, 6), producer = c("J&J",
"J&J", "Biodiem", "Biodiem", "Biodiem", "Shire"), pcountry = c("USA",
"USA", "AUS", "AUS", "AUS", "Ireland"), collaborator = c(NA,
"Pfizer", NA, "PhaseBio", "Genescience", NA), ccountry = c(NA,
"USA", NA, "USA", "China", NA), val = c(25, 25, 65, 65, 65, 54
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
Upvotes: 2
Views: 87
Reputation: 7592
It's super easy with tidyr
:
require(tidyr)
separate_rows(df, collaborator,ccountry, sep="\r\n")
# A tibble: 5 x 6
id producer pcountry collaborator ccountry val
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1 J&J USA Pfizer USA 25
2 2 Biodiem AUS PhaseBio USA 65
3 2 Biodiem AUS Genescience China 65
4 3 Shire Ireland NA NA 54
5 4 Sanofi France NA NA 64
If you want all those extra rows with NAs for collaborator and ccountry, you can do this:
require(tidyr)
require(dplyr)
df %>% mutate(collaborator=ifelse(is.na(collaborator), NA, paste0("\r\n",collaborator)),
ccountry=ifelse(is.na(ccountry), NA, paste0("\r\n",ccountry))) %>% # Create extra rows before non NA rows
separate_rows(collaborator,ccountry, sep="\r\n") %>%
mutate(collaborator=ifelse(collaborator=="",NA,collaborator),
ccountry=ifelse(ccountry=="", NA, ccountry)) # change empty strings to NAs
# A tibble: 7 x 6
id producer pcountry collaborator ccountry val
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1 J&J USA NA NA 25
2 1 J&J USA Pfizer USA 25
3 2 Biodiem AUS NA NA 65
4 2 Biodiem AUS PhaseBio USA 65
5 2 Biodiem AUS Genescience China 65
6 3 Shire Ireland NA NA 54
7 4 Sanofi France NA NA 64
Upvotes: 1
Reputation: 107767
Consider a base R approach using strsplit
inside a by
grouping process:
# BUILD LIST OF SUBSET DFs WITH EXPANDED ROWS
df_list <- by(df, df[c("id", "producer", "pcountry", "val")], function(sub)
data.frame(id = sub$id[1],
producer = sub$producer[1],
pcountry = sub$pcountry[1],
collaborator = c(unlist(strsplit(as.character(sub$collaborator), "\r\n")[1]), NA),
ccountry = c(unlist(strsplit(as.character(sub$ccountry), "\r\n")[1]), NA),
val = sub$val[1])
)
# APPEND SUBSETS
final_df <- unique(do.call(rbind, df_list))
# RE-ORDER ROWS AND RESET ROW NAMES
final_df <- with(final_df, final_df[order(id, producer, pcountry, val),])
row.names(final_df) <- NULL
final_df
# id producer pcountry collaborator ccountry val
# 1 1 J&J USA Pfizer USA 25
# 2 1 J&J USA <NA> <NA> 25
# 3 2 Biodiem AUS PhaseBio USA 65
# 4 2 Biodiem AUS Genescience China 65
# 5 2 Biodiem AUS <NA> <NA> 65
# 6 3 Shire Ireland <NA> <NA> 54
# 7 4 Sanofi France <NA> <NA> 64
Upvotes: 1