ML33M
ML33M

Reputation: 415

In R, trying to match and return PART of a string residing in a column for all the elements in a list and store the unique strings

I'm stuck in processing multiple sheets of an excel file in which I would like to access the character information in a particular column per worksheet and wish to sort out the unique ones and return them.

I have read in the excel file using read_excel function and sort each excel sheet into an element of a list. 15 sheets/elements in this list so far as tibbles. Each tibble varies in row numbers but they have the same column layouts. Luckily the column storing information of interest is called "well_sample". Each row entry in the "well_sample" is following an ID_Time_Condition_sample_dose format. For example: S1041120_T4-R190118_3S_(-)-IndolactamV_1uM_none.

I am interested in the S1041120 which is the ID (maybe later I will need the time T4-R190118). The trouble is there will be a few thousand rows in that column, perhaps 500 of them have the same ID, different treatments. The other few hundreds of them repeat the same pattern. Or they sometimes are the same ID, different time. So I want to isolate this column in each element of the list, split the strings, match and return only the ID. From there, I figured I could sort out the unique ID by using unique().

Here is what I have:

#build a mock data:

a<- c(rep("S1041120_T4-R190118_3S_ab_1uM_none", 500), 
      rep("S1067120_T4-R190118_3S_ab_1uM_none", 50),
      rep("S5667890_T2-R191218_3S_amnb_10uM_none", 2800))
b<- c(rep("S1023120_T0-R190118_3S_ab_1uM_none", 1000), 
      rep("S6787120_T1-R190118_3S_ab_1uM_none", 1000),
      rep("S237890_T2-R191218_3S_amnb_10uM_none", 100))

mysheets<- list(a,b)

#first set aside an empty list to store IDs
ID.cleaned <- list()

#try to use a loop to go through all elements in the list
for (i in 1:length(mysheets)) {
  #reminder the real list has my interest data in well_sample
  #mysheets[[i]]$well_sample should allow me to access that particular column
  #strsplit() to cut the string by "_"

  ID[[i]]<- strsplit(mysheets[[i]]$well_sample, "_", fixed = TRUE)
 
  #use lapply to apply the str_extract function
  #since the ID is always the first and the numbers after S is always between 4-10
  ID.cleaned[[i]]<-lapply(ID[[i]], function(x){str_extract(x,"^S\\d{4,10}")})
}

I tried to run the code outside the loop line by line to see what each code is doing.

 mysheets[[1]]$well_sample 

indeed gave me a character vector of [1:3350] from the 1st element in the list.

 ID <- strsplit(well_sample_column, "_")

then

  str_extract(ID[[1]],"^S\\d{4,10}")

Things are getting weird here, this gave me a list of [1:3350], I'm effectively blowing up my list. And the second line gave "[1] "S1041120" NA NA NA NA NA " instead of just the S1041120.

After running the codes, I got a ID.cleaned a list of 15 elements, corresponding to the 15 original excel sheets, but each element now contains a sub 3350 or so lists, 550 in a format to "S1041120" NA NA NA NA NA, other 2800 in "S5667890" NA NA NA NA NA

I feel im in a dead loop. How am I going to take the IDs out and sort the unique ones? A little frustrated.

Upvotes: 0

Views: 82

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389135

I think you don't need the lapply over ID's or even strsplit. You can use str_extract directly there.

You can also use lapply instead of for loop, something along the lines of :

lapply(mysheets, function(x) stringr::str_extract(x$well_sample,"^S\\d{4,10}"))

Upvotes: 1

Related Questions