Reputation: 43
I have dataframe (Annot_Subset) that looks like this:
IlmnID UCSC_RefGene_Group
cg00050873 Body;TSS1500
cg00212031 TSS200
cg00213748
cg00214611 1stExon;5'UTR
cg00455876
cg01707559 TSS200;TSS200;TSS200
cg02004872 1stExon;5'UTR
What I would like to do is clean up the last column by doing the following:
Replace entries with different strings (ex. Body; TSS1500) with the string "Multiple_locations"
Condense entries with same string repeated (ex. TSS200; TSS200) with that string appearing once
Adding the string "Intergenic" to entries that contain nothing.
To give an example:
IlmID UCSC_RefGene_Group
cg00050873 Multiple_locations
cg00212031 TSS200
cg00213748 Intergenic
cg00214611 Multiple_locations
cg00455876 Intergenic
cg01707559 TSS200
cg02004872 Multiple_locations
I have written a function that will do this; but was wondering if there was a more elegant and efficient way to approach the problem. Especially since my dataframe has 485 000 rows.
This is what I have come up with:
Gene_Group_Split<-strsplit(Annot_Subset$UCSC_RefGene_Group,";")
Clean.Gene.Group<-function(x) {
Gene_Group_Cleaned<-vector(mode="character",length=length(x))
for (i in 1:length(x)) {
if (length(x[[i]])>=1) {
unique_set<-unique(x[[i]])
if (length(unique_set)==1) {
Gene_Group_Cleaned[i]<-unique_set
} else {
Gene_Group_Cleaned[i]<-"Multiple_locations"
}
} else {
Gene_Group_Cleaned[i]<-"Intergenic"
}
}
return(Gene_Group_Cleaned)
}
Gene_Group_2<-Clean.Gene.Group(Gene_Group_Split)
Upvotes: 1
Views: 109
Reputation: 4024
library(dplyr)
library(stringi)
library(tidyr)
df = read.table(text="
IlmnID UCSC_RefGene_Group
cg00050873 Body;TSS1500
cg00212031 TSS200
cg00213748
cg00214611 1stExon;5'UTR
cg00455876
cg01707559 TSS200;TSS200;TSS200
cg02004872 1stExon;5'UTR", fill=TRUE, header=TRUE, stringsAsFactors=FALSE)
classify = function(string_vector)
if (length(string_vector) > 1) "Multiple_locations" else
if (string_vector == "") "Intergenic" else
string_vector
df %>%
mutate(UCSC_RefGene_Group =
UCSC_RefGene_Group %>%
stri_split_fixed(";")) %>%
unnest(UCSC_RefGene_Group) %>%
distinct %>%
group_by(IlmnID) %>%
summarize(class = classify(UCSC_RefGene_Group))
Upvotes: 0
Reputation: 56149
Try this example:
df <- read.table(text="
IlmnID UCSC_RefGene_Group
cg00050873 Body;TSS1500
cg00212031 TSS200
cg00213748
cg00214611 1stExon;5'UTR
cg00455876
cg01707559 TSS200;TSS200;TSS200
cg02004872 1stExon;5'UTR", fill=TRUE, header=TRUE, stringsAsFactors=FALSE)
df$Type <-
unlist(
lapply(df$UCSC_RefGene_Group, function(i){
x <- unique(unlist(strsplit(i,split = ";")))
ifelse(length(x)>1,"Multiple_locations",
ifelse(length(x)==0,"Intergenic",x))
})
)
#result
df
# IlmnID UCSC_RefGene_Group Type
# 1 cg00050873 Body;TSS1500 Multiple_locations
# 2 cg00212031 TSS200 TSS200
# 3 cg00213748 Intergenic
# 4 cg00214611 1stExon;5'UTR Multiple_locations
# 5 cg00455876 Intergenic
# 6 cg01707559 TSS200;TSS200;TSS200 TSS200
# 7 cg02004872 1stExon;5'UTR Multiple_locations
Upvotes: 2