wertq
wertq

Reputation: 1

How to remove everything from a row except pattern

I have a dataframe that contains one column separated by ; like this

AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455

What I want is remove everything except the codes that starts with AB....

AB00001;AB00002
AB84375
AB84375;AB84375
AB001

I've tried to separate them with separate(), but I don´t know how to continue. Any suggestions?

Upvotes: 0

Views: 503

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269481

1) Base R Assuming DF shown reproducibly in the Note at the end we prefix each line with a semicolon and then use the gsub with the pattern shown and finally remove the semicolon we added. No packages are used.

transform(DF, V1 = sub("^;", "", gsub("(;AB\\d+)|;[^;]*", "\\1", paste0(";", V1))))

giving:

               V1
1 AB00001;AB00002
2         AB84375
3 AB84375;AB84375
4           AB001

2) dplyr/tidyr This one is longer than the others in this answer but it is straight forward and has no complex regular expressions.

library(dplyr)
library(tidyr)

DF %>%
  mutate(id = 1:n()) %>%
  separate_rows(V1, sep = ";") %>%
  filter(substr(V1, 1, 2) == "AB") %>%
  group_by(id) %>%
  summarize(V1 = paste(V1, collapse = ";")) %>%
  ungroup %>%
  select(-id)

giving:

# A tibble: 4 x 1
  V1             
  <chr>          
1 AB00001;AB00002
2 AB84375        
3 AB84375;AB84375
4 AB001          

3) gsubfn Replace codes that do not start with AB with an empty string and then remove redundant semicolons from what is left.

library(gsubfn)
transform(DF, V1 = gsub("^;|;$", "", gsub(";+", ";", 
  gsubfn("[^;]*", ~ if (substr(x, 1, 2) == "AB") x else "", V1))))

giving:

               V1
1 AB00001;AB00002
2         AB84375
3 AB84375;AB84375
4           AB001

Note

Lines <- "AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455"
DF <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)

Upvotes: 2

Martin Gal
Martin Gal

Reputation: 16978

I thought of using stringr and Daniel O's data:

df %>%
  mutate(data = str_extract_all(data, "AB\\w+"))

which gives us

              data
1 AB00001, AB00002
2          AB84375
3 AB84375, AB84375
4            AB001

Upvotes: 2

Allan Cameron
Allan Cameron

Reputation: 173793

If your data frame is called df and your column is called V1, you could try:

sapply(strsplit(df$V1, ";"), function(x) paste(grep("^AB", x, value = TRUE), collapse = ";"))
#> [1] "AB00001;AB00002" "AB84375"         "AB84375;AB84375" "AB001" 

This splits at all the semicolons then matches all strings starting with "AB", then joins them back together with semicolons.

Upvotes: 4

Related Questions