Sebastian Ettner
Sebastian Ettner

Reputation: 53

Delete duplicate elements in String in R

I've got some problems deleting duplicate elements in a string. My data look similar to this:


idvisit     path
1           1,16,23,59
2           2,14,14,19
3           5,19,23,19
4           10,10
5           23,23,27,29,23

I have a column containing an unique ID and a column containing a path for web page navigation. The right column contains some cases, where pages just were reloaded and the page were tracked twice or even more. The pages are separated with commas and are saved as factors. My problem is, that I don't want to have multiple pages in a row, so the data should look like this.


idvisit     path
1           1,16,23,59
2           2,14,19
3           5,19,23,19
4           10
5           23,27,29,23

The multiple pages next to each other should be removed. I know how to delete a specific multiple number using regexpressions, but I have about 20.000 different pages and can't do this for all of them. Does anyone have a solution or a hint, for my problem?

Thanks Sebastian

Upvotes: 3

Views: 3137

Answers (2)

David Leal
David Leal

Reputation: 6749

Using stringr package, with function: str_replace_all, I think it gets what you want using the following regular expression: ([0-9]+),\\1and then replace it with \\1 (we need to scape the \ special character):

library(stringr)
> str_replace_all("5,19,23,19", "([0-9]+),\\1", "\\1")
[1] "5,19,23,19"
> str_replace_all("10,10", "([0-9]+),\\1", "\\1")
[1] "10"
> str_replace_all("2,14,14,19", "([0-9]+),\\1", "\\1")
[1] "2,14,19"

You can use it in a array form: x <- c("5,19,23,19", "10,10", "2,14,14,19") then:

str_replace_all(x, "([0-9]+),\\1", "\\1")
[1] "5,19,23,19" "10"         "2,14,19"

or using sapply:

result <- sapply(x, function(x) str_replace_all(x, "([0-9]+),\\1", "\\1"))

Then:

> result
  5,19,23,19        10,10   2,14,14,19 
"5,19,23,19"         "10"    "2,14,19" 

Notes:

The first line is the attribute information:

> str(result)
Named chr [1:3] "5,19,23,19" "10" "2,14,19"
- attr(*, "names")= chr [1:3] "5,19,23,19" "10,10" "2,14,14,19"

If you don't want to see them (it does not affect the result), just do:

attributes(result) <- NULL

Then,

> result
[1] "5,19,23,19" "10"         "2,14,19"   

Explanation about the regular expression used: ([0-9]+),\\1

  1. ([0-9]+): Starts with a group 1 delimited by () and finds any digit (at least one)
  2. ,: Then comes a punctuation sign: , (we can include spaces here, but the original example only uses this character as delimiter)
  3. \\1: Then comes an identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.

Then if the pattern matches, it replaces it, with the value of the variable \\1, i.e. the first time the number appears in the pattern matched.

How to handle more than one duplicated number, for example 2,14,14,14,19?:

Just use this regular expression instead: ([0-9]+)(,\\1)+, then it matches when at least there is one repetition of the delimiter (right) and the number. You can try other possibilities using this regex101.com (in MHO it more user friendly than other online regular expression checkers).

I hope this would work for you, it is a flexible solution, you just need to adapt it with the pattern you need.

Upvotes: 1

akrun
akrun

Reputation: 887028

We can use tidyverse. Use the separate_rows to split the 'path' variable by the delimiter (,) to convert to a long format, then grouped by 'idvisit', we paste the run-length-encoding values

library(tidyverse)
separate_rows(df1, path) %>%
       group_by(idvisit) %>%
       summarise(path = paste(rle(path)$values, collapse=","))
# A tibble: 5 × 2
#  idvisit        path
#    <int>       <chr>
#1       1  1,16,23,59
#2       2     2,14,19
#3       3  5,19,23,19
#4       4          10
#5       5 23,27,29,23

Or a base R option is

df1$path <- sapply(strsplit(df1$path, ","), function(x) paste(rle(x)$values, collapse=","))

NOTE: If the 'path' column is factor class, convert to character before passing as argument to strsplit i.e. strsplit(as.character(df1$path), ",")

Upvotes: 5

Related Questions