Reputation: 45
I've looked through the following pages on using regex to isolate a string:
Regular expression to extract text between square brackets
What is a non-capturing group? What does (?:) do?
Split data frame string column into multiple columns
I have a dataframe which contains protein/gene identifiers, and in some cases there are two or more of these strings (seperated by a comma) because of multiple matches from a list. In this case the first string is the strongest match and I'm not necessarily interested in keeping the rest.They represent multiple matches from inferred evidence and when they cannot be easily discriminated all of the hits get put into a column. In this case I'm only interested in keeping the first because the group will likely have the same type of annotation (i.e. type of protein, gene ontology, similar function etc) If I split the multiple entries into more rows then it would appear that I have evidence that they exist in my dataset, but at the empirical level I don't.
My dataframe:
protein
1 sp|P50213|IDH3A_HUMAN
2 sp|Q9BZ95|NSD3_HUMAN
3 sp|Q92616|GCN1_HUMAN
4 sp|Q9NSY1|BMP2K_HUMAN
5 sp|O75643|U520_HUMAN
6 sp|O15357|SHIP2_HUMAN
523 sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|
524 sp|Q96KB5|TOPK_HUMAN
525 sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN
526 sp|O00299|CLIC1_HUMAN
527 sp|P25940|CO5A3_HUMAN
The output I am trying to create:
uniprot gene
P50213 IDH3A
Q9BZ95 NSD3
Q92616 GCN1
P12277 KCRB
I'm trying to use extract
and separate
functions to do this:
extract(df, protein, into = c("uniprot", "gene"), regex = c("sp|(.*?)|","
(.*?)_"), remove = FALSE)
results in:
Error: is_string(regex) is not TRUE
trying separate
to at least break apart the two in multiple steps:
separate(df, protein, into = c("uniprot", "gene"), sep = "|", remove =
FALSE)
results in:
Warning message:
Expected 2 pieces. Additional pieces discarded in 528 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
protein uniprot gene
1 sp|P50213|IDH3A_HUMAN s
2 sp|Q9BZ95|NSD3_HUMAN s
3 sp|Q92616|GCN1_HUMAN s
4 sp|Q9NSY1|BMP2K_HUMAN s
5 sp|O75643|U520_HUMAN s
6 sp|O15357|SHIP2_HUMAN s
What is the best way to use regex in this scenario and are extract
or separate
the best way to go about this? Any suggestion would be greatly appreciated. Thanks!
Update based on feedback:
df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN", "sp|Q9BZ95|NSD3_HUMAN",
"sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN",
"sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|",
"sp|Q96KB5|TOPK_HUMAN", "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN",
"sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "523", "524", "525", "526"))
df1 <- separate(df, protein, into = "protein", sep = ",")
#i'm only interested in the first match, because science
df2 <- extract(df1, protein, into = c("uniprot", "gene"), regex = "sp\\|
([^|]+)\\|([^_]+)", remove = FALSE)
#create new columns with uniprot code and gene id, no _HUMAN
#df2
# protein uniprot gene
#1 sp|P50213|IDH3A_HUMAN P50213 IDH3A
#2 sp|Q9BZ95|NSD3_HUMAN Q9BZ95 NSD3
#3 sp|Q92616|GCN1_HUMAN Q92616 GCN1
#4 sp|Q9NSY1|BMP2K_HUMAN Q9NSY1 BMP2K
#5 sp|O75643|U520_HUMAN O75643 U520
#6 sp|O15357|SHIP2_HUMAN O15357 SHIP2
#523 sp|P10599|THIO_HUMAN P10599 THIO
#524 sp|Q96KB5|TOPK_HUMAN Q96KB5 TOPK
#525 sp|P12277|KCRB_HUMAN P12277 KCRB
#526 sp|O00299|CLIC1_HUMAN O00299 CLIC1
#and the answer using %>% pipes (this is what I aspire to)
df_filtered <- df %>%
separate(protein, into = "protein", sep = ",") %>%
extract(protein, into = c("uniprot", "gene"), regex = "sp\\|([^|]+)\\|([^_]+)") %>%
select(uniprot, gene)
#df_filtered
# uniprot gene
#1 P50213 IDH3A
#2 Q9BZ95 NSD3
#3 Q92616 GCN1
#4 Q9NSY1 BMP2K
#5 O75643 U520
#6 O15357 SHIP2
#523 P10599 THIO
#524 Q96KB5 TOPK
#525 P12277 KCRB
#526 O00299 CLIC1
Upvotes: 1
Views: 514
Reputation: 887421
We can capture the pattern as a group ((...)
) in extract
. Here, we match sp
at the beginning (^
) of the string followed by a |
(metacharacter - escaped \\
), followed by one or more characters not a |
captured as a group, followed by a |
and the second set of characters captured
library(tidyverse)
extract(df, protein, into = c("uniprot", "gene"),
regex = "^sp\\|([^|]+)\\|([^|]+).*")
If there are multiple instances of 'sp', then separate the rows into long format with separate_rows
and then use extract
df %>%
separate_rows(protein, sep=",") %>%
extract(protein, into = c("uniprot", "gene"),
"^sp\\|([^|]+)\\|([^|]*).*")
There is one instance where there is only two sets of words. To make it working
df %>%
separate_rows(protein, sep=",") %>%
extract(protein, into = "gene", "([^|]*HUMAN)", remove = FALSE) %>%
mutate(uniprot = str_extract(protein, "(?<=sp\\|)[^_]+(?=\\|)")) %>%
select(uniprot, gene)
# uniprot gene
#1 P50213 IDH3A_HUMAN
#2 Q9BZ95 NSD3_HUMAN
#3 Q92616 GCN1_HUMAN
#4 Q9NSY1 BMP2K_HUMAN
#5 O75643 U520_HUMAN
#6 O15357 SHIP2_HUMAN
#7 P10599 THIO_HUMAN
#8 <NA> THIO_HUMAN
#9 Q96KB5 TOPK_HUMAN
#10 P12277 KCRB_HUMAN
#11 P17540 KCRS_HUMAN
#12 P12532 KCRU_HUMAN
#13 O00299 CLIC1_HUMAN
df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN", "sp|Q9BZ95|NSD3_HUMAN",
"sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN",
"sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|",
"sp|Q96KB5|TOPK_HUMAN", "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN",
"sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "523", "524", "525", "526"))
Upvotes: 1