Reputation: 493
I have two dataframes and want to map both and give a binary value 1 if present else 0.
1st DF
id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
111.txt NA NA NA NA NA NA NA NA NA NA NA
112.txt NA NA NA NA NA NA NA NA NA NA NA
113.txt NA NA NA NA NA NA NA NA NA NA NA
114.txt NA NA NA NA NA NA NA NA NA NA NA
115.txt NA NA NA NA NA NA NA NA NA NA NA
116.txt NA NA NA NA NA NA NA NA NA NA NA
2nd DF
id cats
111.cats 1,7,1
112.cats 1,1,2|1,3,2
113.cats 1,10,1|1,6,2
114.cats 1,4,2
115.cats 1,5,1
116.cats 1,1,2|1,8,1
In 2nd DF$cats
first row have 1,7,1
where 1
and 7
combines and makes 1_7
column and on this column binary value 1 is placed and on remaining column 0
is placed and last 1
number goes to the 1_freq
column,
and if any row has more than 1 category like this 1,10,1|1,6,2
where 1,10,1
goes to 1_10
column and 1,6,2
goes to 1_6
column and the frequency of both category sums up and goes to 1_freq
column.
DF Should look like this
id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
111.txt 0 0 0 0 0 0 1 0 0 0 1
112.txt 1 0 1 0 0 0 0 0 0 0 4
113.txt 0 0 0 0 0 1 0 0 0 1 3
114.txt 0 0 0 1 0 0 0 0 0 0 2
115.txt 0 0 0 0 1 0 0 0 0 0 1
116.txt 1 0 0 0 0 0 0 1 0 0 3
Hope question is clear. Thankyou
Upvotes: 1
Views: 86
Reputation: 42544
Although the question is tagged dplyr, I was curious how a data.table answer would look like.
As df1
is filled with NA
except for the id
column and the id
columns only differ in the trailing part (txt
vs cats
) the answer below suggests to create df1
completely from the data contained in df2
:
library(data.table)
library(magrittr)
long <- setDT(df2)[, strsplit(cats, "[|]"), by = id][
, c(.(id = id), tstrsplit(V1, ","))][
, V3 := factor(V3, levels = 1:10)]
df1 <- dcast(long, id ~ V3, function(x) pmax(1, length(x)),
value.var = "V3", drop = FALSE, fill = 0)[
long[, sum(as.integer(V4)), by = id], on = "id", freq := V1][
, id := stringr::str_replace(id, "cats$", "txt")][
, setnames(.SD, names(.SD)[-1], paste0("1_", names(.SD)[-1]))]
df1
id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq 1: 111.txt 0 0 0 0 0 0 1 0 0 0 1 2: 112.txt 1 0 1 0 0 0 0 0 0 0 4 3: 113.txt 0 0 0 0 0 1 0 0 0 1 3 4: 114.txt 0 0 0 1 0 0 0 0 0 0 2 5: 115.txt 0 0 0 0 1 0 0 0 0 0 1 6: 116.txt 1 0 0 0 0 0 0 1 0 0 3
After coersion to data.table, df2
is reshaped from a "stringified" wide format into long
form by splitting the cats
column at "|" first and then by splitting the comma-separated parts into separate columns V2
to V4
.
Then V3
is turned from character to factor to preserve the order of columns when dcast()
is called to reshape from long to wide format again. As the OP has requested to show a 1
if at least one combination exists, the custom function definition function(x) pmax(1, length(x))
has to be used here instead of simply length
. In an update join, the sum of frequencies is appended as column freq
. Finally, "cats" is replaced by "txt" in the id
column and the column names (except the id
column) are prefixed by "1_".
df2 <- data.table::fread("id cats
111.cats 1,7,1
112.cats 1,1,2|1,3,2
113.cats 1,10,1|1,6,2
114.cats 1,4,2
115.cats 1,5,1
116.cats 1,1,2|1,8,1", data.table = FALSE)
Upvotes: 1
Reputation: 887138
Here is an option using tidyverse
. We expand the rows of the dataset by splitting at |
of the 'cats' column, then separate
the 'cats' into two column by splitting at the last ,
, grouped by 'id', get the sum
of the 'freq' column, extract the numbers at the end of the 'cats', convert it to factor
with levels
specified, create a column of 1s ('val'), spread
it to 'wide' format
library(tidyverse)
o1 <- df2 %>%
separate_rows(cats, sep = "[|]") %>%
separate(cats, into = c('cats', 'freq'),
sep=",(?=[^,]+$)", convert = TRUE) %>%
group_by(id) %>%
mutate(freq = sum(freq),
cats = factor(str_extract(cats, "\\d+$"), levels = 1:10),
val = 1) %>%
spread(cats, val, fill = 0) %>%
rename_at(-1, ~ paste0('1_', .))
Now, we assign values for the columns that are common with the initial dataset ('df1')
df1[is.na(df1)] <- 0
df1[names(o1)[-1]] <- o1[-1]
df1
# id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
#1 111.txt 0 0 0 0 0 0 1 0 0 0 1
#2 112.txt 1 0 1 0 0 0 0 0 0 0 4
#3 113.txt 0 0 0 0 0 1 0 0 0 1 3
#4 114.txt 0 0 0 1 0 0 0 0 0 0 2
#5 115.txt 0 0 0 0 1 0 0 0 0 0 1
#6 116.txt 1 0 0 0 0 0 0 1 0 0 3
df1 <- structure(list(id = c("111.txt", "112.txt", "113.txt", "114.txt",
"115.txt", "116.txt"), `1_1` = c(NA, NA, NA, NA, NA, NA), `1_2` = c(NA,
NA, NA, NA, NA, NA), `1_3` = c(NA, NA, NA, NA, NA, NA), `1_4` = c(NA,
NA, NA, NA, NA, NA), `1_5` = c(NA, NA, NA, NA, NA, NA), `1_6` = c(NA,
NA, NA, NA, NA, NA), `1_7` = c(NA, NA, NA, NA, NA, NA), `1_8` = c(NA,
NA, NA, NA, NA, NA), `1_9` = c(NA, NA, NA, NA, NA, NA), `1_10` = c(NA,
NA, NA, NA, NA, NA), `1_freq` = c(NA, NA, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(id = c("111.cats", "112.cats", "113.cats", "114.cats",
"115.cats", "116.cats"), cats = c("1,7,1", "1,1,2|1,3,2", "1,10,1|1,6,2",
"1,4,2", "1,5,1", "1,1,2|1,8,1")), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 2