user2300940
user2300940

Reputation: 2385

Split rows in data.frame based on symbol

I would like to split rows that has strings separated by / or ab into new rows. Please see input and output examples below. Maybe we could use strsplit(d$values,split='/', fixed=TRUE), but how do you make them into new rows?

head

> head(d,3)
                    values   ind
54                 miR-197  9846
55                 miR-197  9846
113 miR-221/222/222ab/1928 56204

dput

    structure(list(values = c("miR-197", "miR-197", "miR-221/222/222ab/1928"
), ind = structure(c(4L, 4L, 6L), .Label = c("6482", "4057", 
"60481", "9846", "7414", "56204", "84957", "29924", "6095", "8301", 
"2355", "88455", "23047", "57590", "5829", "162", "3091", "9766", 
"23406", "3646", "22870", "22898", "8775", "8178", "2077", "64115", 
"6158", "5007", "8567", "10019", "26127", "4739", "6678", "27013", 
"6146", "51060", "1997", "847", "4035", "79026", "8192", "5782", 
"1032", "4354", "5791", "2752", "9873", "6386", "5962", "2230", 
"6938", "6727", "7090", "92912", "55784", "409", "23521", "6279", 
"51312", "7357", "2040", "2934", "9219", "2180", "219333", "114908", 
"50807", "90268", "3098", "1974", "56990", "7791", "162989", 
"9159", "7086", "51762", "9318", "23582", "10632", "54815", "1938", 
"3576", "11214", "167227", "156", "2745", "6138", "391", "10933", 
"5501", "3638", "2316", "2869", "10527", "255809"), class = "factor")), row.names = c(54L, 
55L, 113L), class = "data.frame")

expected output

values   ind
miR-197  9846
miR-197  9846
miR-221 56204
222a    56204
222b    56204
1928    56204

Upvotes: 3

Views: 60

Answers (1)

Aaron Hayman
Aaron Hayman

Reputation: 525

Shamelessly stealing Ronak Shah's initial solution. Then splitting the 'ab'.

This solution is not generally applicable to all letters, so attention and modification would need to be used if trying to scale this to a larger problem with letters beyond "a" and "b".

First we add an index column to help with ordering, which will be removed at the end.

df <- tidyr::separate_rows(df, values, sep = "\\/")
df$index <- seq_len(nrow(df))
df_ab <- df[grep('ab',df$values),]
df$values <- gsub("ab","a",df$values)
df_ab$values <- gsub("ab","b",df_ab$values)
df <- rbind(df,df_ab)
df <- df[order(df$index),]
df$index <- NULL

##     values   ind
## 1  miR-197  9846
## 2  miR-197  9846
## 3  miR-221 56204
## 4      222 56204
## 5     222a 56204
## 51    222b 56204
## 6     1928 56204

Upvotes: 2

Related Questions