Reputation: 1122

Remove rows that contain more than one string in a cell in a data frame

I have a data frame with 8 columns and many many rows. I would like to remove the rows contain more than one string in column 6 and 7 and output a data frame with only one string in column 6 and 7

df:

ID  Content_ID  Chromosome  Start   Stop    Reference   Alternate Length
1299675221  backbone    12  99675221    99675221    GG  T   0
1298583685  backbone    12  98583685    98583685    C   T   0
129833474   backbone    12  9833474     9833474     C   T   0
1297722695  backbone    12  97722695    97722695    A   G   0
1297381269  backbone    12  97381269    97381269    T   C   0
1297081605  backbone    12  97081605    97081605    G   AA  0
1297058068  backbone    12  97058068    97058068    T   C   0
1295891848  backbone    12  95891848    95891848    CCTT ATA    0
1294164312  backbone    12  94164312    94164312    T   C   0
12940191    backbone    12  940191      940191      T   C   0

desired output:

ID  Content_ID  Chromosome  Start   Stop    Reference   Alternate   Length
1298583685  backbone    12  98583685    98583685    C   T   0
129833474   backbone    12  9833474     9833474     C   T   0
1297722695  backbone    12  97722695    97722695    A   G   0
1297381269  backbone    12  97381269    97381269    T   C   0
1297058068  backbone    12  97058068    97058068    T   C   0
1294164312  backbone    12  94164312    94164312    T   C   0
12940191    backbone    12  940191      940191      T   C   0

Upvotes: 3

Answers (3)

rafa.pereira

Reputation: 13807

Simple as this using data.table

library(data.table)

setDT(df)
df <- df[ nchar(Reference)==1 & nchar(Alternate)==1]

Upvotes: 1

lmo

Reputation: 38500

Similarly, you can paste the columns together and then keep rows where the number of characters is equal to 3, one for each column and a space.

df[nchar(paste(df$Reference, df$Alternate)) == 3,]
           ID Content_ID Chromosome    Start     Stop Reference Alternate Length
2  1298583685   backbone         12 98583685 98583685         C         T      0
3   129833474   backbone         12  9833474  9833474         C         T      0
4  1297722695   backbone         12 97722695 97722695         A         G      0
5  1297381269   backbone         12 97381269 97381269         T         C      0
7  1297058068   backbone         12 97058068 97058068         T         C      0
9  1294164312   backbone         12 94164312 94164312         T         C      0
10   12940191   backbone         12   940191   940191         T         C      0

Upvotes: 2

akrun

Reputation: 887138

We can loop through the columns 6 and 7 using lapply, check whether the number of characters is 1, use Reduce with & to get a logical vector by comparing the corresponding elements of the list, use it to subset the rows of 'df'

df[Reduce(`&`, lapply(df[6:7], function(x) nchar(x)==1)),]
#        ID Content_ID Chromosome    Start     Stop Reference Alternate Length
#2  1298583685   backbone         12 98583685 98583685         C         T      0
#3   129833474   backbone         12  9833474  9833474         C         T      0
#4  1297722695   backbone         12 97722695 97722695         A         G      0
#5  1297381269   backbone         12 97381269 97381269         T         C      0
#7  1297058068   backbone         12 97058068 97058068         T         C      0
#9  1294164312   backbone         12 94164312 94164312         T         C      0
#10   12940191   backbone         12   940191   940191         T         C      0

Or another option is rowSums

df[!rowSums(nchar(as.matrix(df[6:7]))!=1),]

Upvotes: 3

Remove rows that contain more than one string in a cell in a data frame

Answers (3)

Related Questions