PharmDataSci
PharmDataSci

Reputation: 117

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.

Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?

I know if there was just one thing I wanted to remove, like the B15, I could do;

sub(“B15”, ””, df$col)

But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).

Thanks in advance :)

Upvotes: 1

Views: 78

Answers (1)

Sahir Moosvi
Sahir Moosvi

Reputation: 598

Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:

testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")

gsub("B15.{2}", "", testStrings)

gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.

Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.

edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"

https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

Upvotes: 1

Related Questions