Reputation: 679
I have a really big data , and what I am trying to do is to check if a list is having more than one value for a string : please consider this dataframe:
df1 <- structure(list(id = 1:3,
book_id = c("[\"19167120\",\"book\", \"237494310\",\"195166798\",\"book\",\"book.a\"]",
"[\"19167120\",\"237494310\",\"story\",\"book\",\"19167120\"]", "[]")),
.Names = c("id", "book_id"),
class = "data.frame",
row.names = c(NA, -3L))
which is :
id book_id
1 1 ["19167120","book", "237494310","195166798","book","book.a"]
2 2 ["19167120","237494310","story","book","19167120"]
3 3 []
what i am trying to do is to check whether there is any list value from book_id having more than one value of a string , for example here : string value "19167120"
is duplicate for the second row.
and extract them for each cell and then remove them from each cell
output : two separate dataframe :
id book_id duplicate
1: 1 ["19167120", "book", "237494310", "195166798", "book", "book.a"] "book"
2: 2 ["19167120", "237494310", "story", "book", "19167120"] "19167120"
3: 3 [] 0
id book_id
1: 1 ["19167120", "book", "237494310", "195166798", "book.a"]
2: 2 ["19167120", "237494310", "story", "book"]
3: 3 []
I know I should use anyduplicated()
and unique()
inorder to get my answer but I work around them and I couldn't resolve the problem.
edit :the first suggestion by Gregor will be like this but I would appreciate if any one could help me to get the output as I explain first,
id book_id
1: 1 "19167120"
2: 1 "237494310"
3: 1 "195166798"
4: 2 "19167120"
5: 2 "237494310"
6: 2 "19167120"
> unique(df1)
id book_id
1: 1 "19167120"
2: 1 "237494310"
3: 1 "195166798"
4: 2 "19167120"
5: 2 "237494310"
> duplicated(df1)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
Upvotes: 1
Views: 354
Reputation: 99331
You could try stringi
library(stringi)
library(data.table)
x <- stri_extract_all_regex(df1$book_id, "\\d+", omit = TRUE)
dupes <- vapply(x, function(a) {
paste(a[duplicated(a)], collapse = ", ")
}, "")
dt2 <- data.table(df1, dupes)
doBookId <- function(x) {
u <-stri_unique(x)
st <- if(length(u)) toString(shQuote(u, "cmd")) else character(1L)
stri_c("[", st, "]")
}
dt2[, book_id := vapply(x, doBookId, "")][]
# id book_id dupes
# 1: 1 ["19167120", "237494310", "195166798"]
# 2: 2 ["19167120", "237494310"] 19167120
# 3: 3 []
Update : Given your updates, I think we only need a few adjustments. The main one is to switch from stri_extract_all_regex
to stri_extract_all_words
x <- stri_extract_all_words(df1$book_id)
dupes <- vapply(x, function(a) toString(a[duplicated(a)]), "")
dt2 <- data.table(df1, dupes)
dt2
# id book_id dupes
# 1: 1 ["19167120","book", "237494310","195166798","book","book.a"] book
# 2: 2 ["19167120","237494310","story","book","19167120"] 19167120
# 3: 3 []
doBookId <- function(x) {
st <- if(length(x) >= 1L && !is.na(x))
toString(shQuote(unique(x), "cmd"))
else ""
stri_c("[", st, "]")
}
dt3 <- dt2[, dupes := NULL]
dt3[, book_id := vapply(x, doBookId, "")][]
# id book_id
# 1: 1 ["19167120", "book", "237494310", "195166798", "book.a"]
# 2: 2 ["19167120", "237494310", "story", "book"]
# 3: 3 []
Upvotes: 3
Reputation: 193517
Here's an alternative approach, related to the idea of starting with a "long" dataset and proceeding from there.
This is your long dataset.
library(splitstackshape)
x <- cSplit(df1, "book_id", ",", "long")[, book_id := gsub(
"[][]", "", book_id)]
Here, we add a "duped" column with the duplicated values:
x[, duped := paste(unique(book_id[duplicated(book_id)],
collapse = ", ")), by = id]
Now we can easily create your first desired output:
dupedX <- x[, list(book_id = sprintf("[%s]", paste(book_id, collapse = ", ")),
duped = paste(unique(duped), collapse = ", ")), by = id]
dupedX
# id book_id duped
# 1: 1 ["19167120", "237494310", "195166798"] NA
# 2: 2 ["19167120", "237494310", "19167120"] "19167120"
# 3: 3 [] NA
And your second:
uniqueX <- x[, list(book_id = sprintf(
"[%s]", paste(unique(book_id), collapse = ", "))), by = id]
uniqueX
# id book_id
# 1: 1 ["19167120", "237494310", "195166798"]
# 2: 2 ["19167120", "237494310"]
# 3: 3 []
Upvotes: 3