Reputation: 1
I have a dataframe that contains one column separated by ;
like this
AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455
What I want is remove everything except the codes that starts with AB....
AB00001;AB00002
AB84375
AB84375;AB84375
AB001
I've tried to separate them with separate()
, but I don´t know how to continue. Any suggestions?
Upvotes: 0
Views: 503
Reputation: 269481
1) Base R Assuming DF
shown reproducibly in the Note at the end we prefix each line with a semicolon and then use the gsub
with the pattern shown and finally remove the semicolon we added. No packages are used.
transform(DF, V1 = sub("^;", "", gsub("(;AB\\d+)|;[^;]*", "\\1", paste0(";", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
2) dplyr/tidyr This one is longer than the others in this answer but it is straight forward and has no complex regular expressions.
library(dplyr)
library(tidyr)
DF %>%
mutate(id = 1:n()) %>%
separate_rows(V1, sep = ";") %>%
filter(substr(V1, 1, 2) == "AB") %>%
group_by(id) %>%
summarize(V1 = paste(V1, collapse = ";")) %>%
ungroup %>%
select(-id)
giving:
# A tibble: 4 x 1
V1
<chr>
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
3) gsubfn Replace codes that do not start with AB with an empty string and then remove redundant semicolons from what is left.
library(gsubfn)
transform(DF, V1 = gsub("^;|;$", "", gsub(";+", ";",
gsubfn("[^;]*", ~ if (substr(x, 1, 2) == "AB") x else "", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
Lines <- "AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455"
DF <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)
Upvotes: 2
Reputation: 16978
I thought of using stringr
and Daniel O's data:
df %>%
mutate(data = str_extract_all(data, "AB\\w+"))
which gives us
data
1 AB00001, AB00002
2 AB84375
3 AB84375, AB84375
4 AB001
Upvotes: 2
Reputation: 173793
If your data frame is called df
and your column is called V1
, you could try:
sapply(strsplit(df$V1, ";"), function(x) paste(grep("^AB", x, value = TRUE), collapse = ";"))
#> [1] "AB00001;AB00002" "AB84375" "AB84375;AB84375" "AB001"
This splits at all the semicolons then matches all strings starting with "AB", then joins them back together with semicolons.
Upvotes: 4