Sabor117
Sabor117

Reputation: 135

Searching for part of string within another string in dataframe

I have a dataset which looks something like this:

long_name x y short_name
Adhesion G protein-coupled receptor E2 (ADGRE2) 10 10 ADGRE2
Adhesion G-protein coupled receptor G2 (ADGRG2) 12 12 ADX2
ADM (ADM) 13 13 ADM
ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (CD38) 14 14 ACH1

What I want to do is create an additional column which will state whether or not the value of short_name is in the value of long_name to produce a TRUE/FALSE (or present/not) value in a new column.

I saw some advice on here about using the grepl function for looking for a bit of a string within another string. The issue I'm having is trying to iterate it over the whole file.

I have something like:

for (row in 1:length(nrows(combined_proteins))){

  long_name = proteins[1]
  short_name = proteins[4]

  if grepl(short_name, long_name) = TRUE{

   proteins$presence = "Present"

   else proteins$presence = "Not"
  }
}

But this obviously doesn't work and I'm not really sure whether this is even the smartest way to go about it. Any help appreciated.

Upvotes: 0

Views: 42

Answers (2)

Beemyfriend
Beemyfriend

Reputation: 271

There's a couple of issues with your for loop. You want to either iterate from 1:nrow() or 1:length(). The length(nrow()) will almost always return 1. Your if statements need to have parentheses so it should be if(boolean){return values}else{other return value} If the name of your data frame is proteins then the following should work.

for (row in 1:nrow(proteins)){

  print(proteins$long_name[row])
  long_name = proteins$long_name[row]
  short_name = proteins$short_name[row]

  if (grepl(short_name, long_name)){
    proteins$presence[row] ="Present"
  } else { 
    proteins$presence[row] = "Not"
  }
}

you can also do the same by using the tidyverse packages dplyr and purrr. purrr provides functions to iterate through multiple columns at the same time.

proteins %>%
  dplyr::mutate(short_in_long = purrr::map2_lgl(short_name, long_name, function(x, y){
    grepl(x, y)
  }))

Upvotes: 1

Dave2e
Dave2e

Reputation: 24089

An easy way of solving this is to use the ifelse function and str_detect from the stringr package.

proteins<-read.table(header = TRUE, stringsAsFactors = FALSE, text=
"long_name x y short_name
'Adhesion G protein-coupled receptor E2 (ADGRE2)' 10 10 ADGRE2
'Adhesion G-protein coupled receptor G2 (ADGRG2)' 12 12 ADX2
'ADM (ADM)' 13 13 ADM
'ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (CD38)' 14 14 ACH1"
)

library(stringr)
proteins$presence<- ifelse( str_detect(proteins$long_name, proteins$short_name ) , "Present",  "Not")

Upvotes: 1

Related Questions