Laura Chipman
Laura Chipman

Reputation: 45

R- Trimming a string in a dataframe after a particular pattern

I am having trouble figuring out how to trim the end off of a string in a data frame.

I want to trim everything to a "base" name, after #s and letters, a period, then a number. My goal is trim everything in my dataframe to this "base" name, then sum the values with the same "base." I was thinking it would be possible to trim, then merge and sum the values.

    ie/
    Gene_name   Values
    B0222.5     4
    B0222.6     16
    B0228.7.1   2
    B0228.7.2   12
    B0350.2h.1  30
    B0350.2h.2  2
    B0350.2i    15
    2RSSE.1a    3
    2RSSE.1b    10
    R02F11.11   4

to

    Gene_name   Values
    B0222.5     4
    B0222.6     16
    B0228.7     14
    B0350.2     47
    2RSSE.1     13
    R02F11.11   4

Thank you for any help!

Upvotes: 2

Views: 185

Answers (3)

kbrendle
kbrendle

Reputation: 1

You can also create the Gene_name as a factor and change the levels.

# coerce the vector as a factor
Gene_name <- as.factor(Gene_name)
# view the levels
levels(Gene_name)
# to make B0228.7.1 into B0228.7
levels(Gene_name)[ *index for B0228.7.1* ] <- B0228.7

You can repeat this for the levels that need to change and then the values will automatically sum together and rows with similar levels will be treated as the same category.

Upvotes: 0

Lamia
Lamia

Reputation: 3875

Here is a solution using the dplyr and stringr packages. You first create a column with your extracted base pattern, and then use the group_by and summarise functions from dplyr to get the sum of values for each name:

library(dplyr)
library(stringr)
df2 = df %>% mutate(Gene_name = str_extract(Gene_name,"[[:alnum:]]+\\.\\d+")) %>% 
group_by(Gene_name) %>% summarise(Values = sum(Values))

  Gene_name Values
      <chr>  <int>
1   2RSSE.1     13
2   B0222.5      4
3   B0222.6     16
4   B0228.7     14
5   B0350.2     47
6 R02F11.11      4

Upvotes: 3

Damiano Fantini
Damiano Fantini

Reputation: 1975

As someone has also suggested, I would get gene names first, and then search for them in the original data.frame

df <- data.frame(Gene_name = c("B0222.5", "B0222.6", "B0228.7.1",  "B0228.7.2", "B0350.2h.1", "B0350.2h.2", "B0350.2i", "2RSSE.1a", "2RSSE.1b", "R02F11.11"),
                 Values = c(4, 16, 2, 12, 30, 2, 15, 3, 10, 4),
                 stringsAsFactors = F)

pat <- "(^[[:alnum:]]+\\.[[:digit:]]*)"
cap.pos <- regexpr(pat,  df$Gene_name)
cap.gene <- unique(substr(df$Gene_name, cap.pos, (cap.pos + attributes(cap.pos)$match.length - 1)))

do.call(rbind, lapply(cap.gene, (function(nm){
  sumval <- sum(df[grepl(nm, df$Gene_name, fixed = T),]$Values, na.rm = T)
  data.frame(Gene_name = nm, Value = sumval)
})))

The result tracks with your request

  Gene_name Value
1   B0222.5     4
2   B0222.6    16
3   B0228.7    14
4   B0350.2    47
5   2RSSE.1    13
6 R02F11.11     4

Upvotes: 0

Related Questions