Reputation: 17
I have a data table containing 3 columns, one of them contains a key:value list of different lengths. I wish to rearrange the table such that each row will have only one key, conditioned on the value
for example, suppose that I wish to get all rows for whom the value is <= 2 so that each key is on its own row:\
input_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"ha:llo\":1,\"wor:ld\":2,\"doog:bye\":3}"),
c=c(1))
the wanted table then should be
tbl_output <- data.table::data.table(a=c("AA",
"AA"),b=c("ha:llo","wor:ld"), c=c(1,1), s=c(1,2))
I had tried the following function:
data_table_clean <- function(dt){
dt[ ,"b" := data.table::tstrsplit(b, ',', fixed = T),by=c(a, c)]
dt[,c('b', 's'):= data.table::tstrsplit(b, ':', fixed=TRUE)]
return(dt[s <=2,])
}
this produces the following error
"Error in eval(expr, envir, enclos) : object 'a' not found"
Any suggestions are welcome, off course.
The keys are actually of the form :
input2_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"99:1d:3u:7y:89:67\":1,\"99:1D:34:YY:T6:Y6\":2,\"ll:5Y:UY:56:R5:R6\":3}"),
c=c(1))
and accordingly the output table should be:
tbl2_output <- data.table::data.table(a=c("AA",
"AA"),b=c(""99:1d:3u:7y:89:67","99:1D:34:YY:T6:Y6"),
c=c(1,1), s=c(1,2))
Thank you!
data_table_clean <- function(dt){
res <- dt[, data.table::tstrsplit(unlist(strsplit(gsub('[{}"]', '', b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE),
by = .(a, c)][V2 > -100]
data.table::setnames(res, 3:4, c("b", "s"))
res
}
when running this I get the following error:
Error in .subset(x, j) : invalid subscript type 'list'
Upvotes: 1
Views: 1137
Reputation: 193497
Since it seems like you are working with a JSON object, why not use something that parses the JSON, for example, the "jsonlite" package?
With that, you can make a simple function, that looks like this:
myFun <- function(invec) {
require(jsonlite)
x <- fromJSON(invec)
list(b = names(x), s = unlist(x))
}
Now, applied to your dataset, you would get:
input_tbl[, myFun(b), by = .(a, c)]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
# 3: AA 1 doog:bye 3
And, for the subsetting:
input_tbl[, myFun(b), by = .(a, c)][s <= 2]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
You can probably also even rewrite the myFun
function to add a "threshold" argument that lets you subset within the function itself.
Upvotes: 1
Reputation: 886938
One option would be to extract the characters that we need in the final output. We use str_extract
to do that after grouping by 'a', 'c'. The output is a list
, which we unlist
, get the non-numeric and numeric into two columns and then subset the rows with the condition s<3
.
library(stringr)
library(data.table)
input_tbl[, {
tmp <- unlist(str_extract_all(b, "[A-Za-z]+:[A-Za-z]+|\\d+"))
list(b=tmp[c(TRUE, FALSE)], s=tmp[c(FALSE, TRUE)])
}, by = .(a,c)][s<3]
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Or if we are using strsplit/tstrsplit
, grouped by 'a', 'c', we remove the curly brackets and quotes ([{}]"
) with gsub
, split by ,
(strsplit
), unlist
the output, and then use tstrsplit
to split by :
that is followed by a number. The subset part is similar as above.
res <- input_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b), ',', fixed=TRUE)), ":(?=\\d)", perl=TRUE) ,.(a,c)][V2<3]
setnames(res, 3:4, c("b", "s"))
res
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
For the updated dataset, we can do the tstrsplit
on the last delimiter (:
)
res1 <- input2_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE) ,
by = .(a, c)][V2 < 3]
setnames(res1, 3:4, c("b", "s"))
res1
# a c b s
# 1: AA 1 99:1d:3u:7y:89:67 1
# 2: AA 1 99:1D:34:YY:T6:Y6 2
Upvotes: 2