Reputation: 13
im working with a data.frame
: on the rows I have probes names and in one colunm I have some iformation about the region where that probes are in the gene (1stExon, Body, etc), but I have a problem:
Gene Gene_Region
cg14736058 PROM1;PROM1;PROM1 TSS200;5'UTR;1stExon
. . 1stExon;1stExon;1stExon;1stExon
. . 1stExon;1stExon;1stExon
. . 1stExon;1stExon;1stExon
. . 1stExon;1stExon;5'UTR;5'UTR;5'UTR;1stExon
. . 1stExon;1stExon
. . 1stExon;1stExon;Body
. . Body;Body
I want the rows where only one region is present, but imanige imagine that "1sExon" is repetead but only that I want that row too. For example, I want the last row because "Body" is the only region repeated, so I consider it as one region only. I dont know if I am mking myself clear. PS. I dont know how many times a string is repeated
Upvotes: 0
Views: 58
Reputation: 1764
This should do the trick. First collapse the string to only contain unique values. If there is only one unique value, the separator ;
will disappear. So then you can just delete the rows that still contain a ;
.
# Load Data
df <- structure(list(Gene_Region = c("TSS1500;5'UTR", "TSS1500;TSS1500;TSS1500;TSS1500", "Body", "1stExon;5'UTR", "1stExon;1stExon;1stExon", "Body", "Body;Body;Body;Body;Body" ), UCSC_RefGene_Name = c("USP44;USP44", "COL11A2;COL11A2;COL11A2;COL11A2", "SOX2OT", "CRYGD;CRYGD", "ENPP2;ENPP2;ENPP2", "PGLYRP2", "KCNQ2;KCNQ2;KCNQ2;KCNQ2;KCNQ2" )), .Names = c("Gene_Region", "UCSC_RefGene_Name"), row.names = c("cg13879483", "cg08481075", "cg13294849", "cg22399133", "cg02534163", "cg16206460", "cg13782274"), class = "data.frame")
# Collapse
df$unique_regions <- sapply(strsplit(df$Gene_Region ,";"), function(x)
paste(unique(x), collapse=";"))
# Search for rows with no semicolon in unique_regions column
df2 <- df[- grep(";", df$unique_regions),]
Upvotes: 2