Reputation: 53
I`ve got some problems filtering for duplicate elements in a string. My data look similar to this:
idvisit path
1 1,16,23,59,16
2 2,14,19,14
3 5,19,23
4 10,21
5 23,27,29,23
I have a column containing an unique ID and a column containing a path for web page navigation. The right column contains some cases, where pages were accessed twice or more often, but some different pages are between these accesses. I just want to filter() the rows, where pages occur twice or more often and at least one page is in bettween the two accesses, so the data should look like this.
idvisit path
1 1,16,23,59,16
2 2,14,19,14
5 23,27,29,23
I just want to remove the rows that match the conditions. I really dont know how to handle a String with using a variable for the many different numbers.
Upvotes: 0
Views: 145
Reputation: 886938
We can try
library(data.table)
lst <- strsplit(df1$path, ",")
df1[lengths(lst) != sapply(lst, uniqueN),]
# idvisit path
#1 1 1,16,23,59,16
#2 2 2,14,19,14
#5 5 23,27,29,23
Or an option using tidyverse
library(tidyverse)
separate_rows(df1, path) %>%
group_by(idvisit) %>%
filter(n_distinct(path) != n()) %>%
summarise(path = toString(path))
Upvotes: 1
Reputation: 23101
You could try regular expressions too with grepl
:
df[grepl('.*([0-9]+),.*,\\1', as.character(df$path)),]
# idvisit path
#1 1 1,16,23,59,16
#2 2 2,14,19,14
#5 5 23,27,29,23
Upvotes: 0
Reputation: 51582
You can filter based on the number of elements in each string. Strings with duplicated entries will be larger than their unique lengths, i.e.
df1[sapply(strsplit(as.character(df1$path), ','), function(i) length(unique(i)) != length(i)),]
# idvisit path
#1 1 1,16,23,59,16
#2 2 2,14,19,14
#5 5 23,27,29,23
Upvotes: 1