Reputation: 135
I have a dataset which looks something like this:
long_name x y short_name
Adhesion G protein-coupled receptor E2 (ADGRE2) 10 10 ADGRE2
Adhesion G-protein coupled receptor G2 (ADGRG2) 12 12 ADX2
ADM (ADM) 13 13 ADM
ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (CD38) 14 14 ACH1
What I want to do is create an additional column which will state whether or not the value of short_name
is in the value of long_name
to produce a TRUE/FALSE (or present/not) value in a new column.
I saw some advice on here about using the grepl
function for looking for a bit of a string within another string. The issue I'm having is trying to iterate it over the whole file.
I have something like:
for (row in 1:length(nrows(combined_proteins))){
long_name = proteins[1]
short_name = proteins[4]
if grepl(short_name, long_name) = TRUE{
proteins$presence = "Present"
else proteins$presence = "Not"
}
}
But this obviously doesn't work and I'm not really sure whether this is even the smartest way to go about it. Any help appreciated.
Upvotes: 0
Views: 42
Reputation: 271
There's a couple of issues with your for
loop. You want to either iterate from 1:nrow()
or 1:length()
. The length(nrow())
will almost always return 1. Your if
statements need to have parentheses so it should be if(boolean){return values}else{other return value}
If the name of your data frame is proteins
then the following should work.
for (row in 1:nrow(proteins)){
print(proteins$long_name[row])
long_name = proteins$long_name[row]
short_name = proteins$short_name[row]
if (grepl(short_name, long_name)){
proteins$presence[row] ="Present"
} else {
proteins$presence[row] = "Not"
}
}
you can also do the same by using the tidyverse
packages dplyr
and purrr
. purrr
provides functions to iterate through multiple columns at the same time.
proteins %>%
dplyr::mutate(short_in_long = purrr::map2_lgl(short_name, long_name, function(x, y){
grepl(x, y)
}))
Upvotes: 1
Reputation: 24089
An easy way of solving this is to use the ifelse
function and str_detect
from the stringr package.
proteins<-read.table(header = TRUE, stringsAsFactors = FALSE, text=
"long_name x y short_name
'Adhesion G protein-coupled receptor E2 (ADGRE2)' 10 10 ADGRE2
'Adhesion G-protein coupled receptor G2 (ADGRG2)' 12 12 ADX2
'ADM (ADM)' 13 13 ADM
'ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (CD38)' 14 14 ACH1"
)
library(stringr)
proteins$presence<- ifelse( str_detect(proteins$long_name, proteins$short_name ) , "Present", "Not")
Upvotes: 1