james joyce
james joyce

Reputation: 493

Map one Dataframe to a second Dataframe

I have two dataframes and want to map both and give a binary value 1 if present else 0.

1st DF

id       1_1   1_2   1_3   1_4   1_5   1_6   1_7   1_8   1_9   1_10  1_freq
111.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
112.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
113.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
114.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
115.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
116.txt  NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

2nd DF

id                 cats
111.cats           1,7,1
112.cats           1,1,2|1,3,2
113.cats           1,10,1|1,6,2
114.cats           1,4,2
115.cats           1,5,1
116.cats           1,1,2|1,8,1

In 2nd DF$cats first row have 1,7,1 where 1 and 7 combines and makes 1_7 column and on this column binary value 1 is placed and on remaining column 0 is placed and last 1 number goes to the 1_freq column, and if any row has more than 1 category like this 1,10,1|1,6,2 where 1,10,1 goes to 1_10 column and 1,6,2 goes to 1_6 column and the frequency of both category sums up and goes to 1_freq column.

DF Should look like this

id       1_1   1_2   1_3   1_4   1_5   1_6   1_7   1_8   1_9   1_10  1_freq
111.txt  0     0     0     0     0     0     1     0     0     0     1
112.txt  1     0     1     0     0     0     0     0     0     0     4
113.txt  0     0     0     0     0     1     0     0     0     1     3
114.txt  0     0     0     1     0     0     0     0     0     0     2
115.txt  0     0     0     0     1     0     0     0     0     0     1
116.txt  1     0     0     0     0     0     0     1     0     0     3

Hope question is clear. Thankyou

Upvotes: 1

Views: 86

Answers (2)

Uwe
Uwe

Reputation: 42544

Although the question is tagged , I was curious how a answer would look like.

As df1 is filled with NA except for the id column and the id columns only differ in the trailing part (txt vs cats) the answer below suggests to create df1 completely from the data contained in df2:

library(data.table)
library(magrittr)
long <- setDT(df2)[, strsplit(cats, "[|]"), by = id][
  , c(.(id = id), tstrsplit(V1, ","))][
    , V3 := factor(V3, levels = 1:10)]
df1 <- dcast(long, id ~ V3, function(x) pmax(1, length(x)), 
             value.var = "V3", drop = FALSE, fill = 0)[
               long[, sum(as.integer(V4)), by = id], on = "id", freq := V1][
                 , id := stringr::str_replace(id, "cats$", "txt")][
                   , setnames(.SD, names(.SD)[-1], paste0("1_", names(.SD)[-1]))]
df1
        id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
1: 111.txt   0   0   0   0   0   0   1   0   0    0      1
2: 112.txt   1   0   1   0   0   0   0   0   0    0      4
3: 113.txt   0   0   0   0   0   1   0   0   0    1      3
4: 114.txt   0   0   0   1   0   0   0   0   0    0      2
5: 115.txt   0   0   0   0   1   0   0   0   0    0      1
6: 116.txt   1   0   0   0   0   0   0   1   0    0      3

Explanation

After coersion to data.table, df2 is reshaped from a "stringified" wide format into long form by splitting the cats column at "|" first and then by splitting the comma-separated parts into separate columns V2 to V4.

Then V3 is turned from character to factor to preserve the order of columns when dcast() is called to reshape from long to wide format again. As the OP has requested to show a 1 if at least one combination exists, the custom function definition function(x) pmax(1, length(x)) has to be used here instead of simply length. In an update join, the sum of frequencies is appended as column freq. Finally, "cats" is replaced by "txt" in the id column and the column names (except the id column) are prefixed by "1_".

Data

df2 <- data.table::fread("id                 cats
111.cats           1,7,1
112.cats           1,1,2|1,3,2
113.cats           1,10,1|1,6,2
114.cats           1,4,2
115.cats           1,5,1
116.cats           1,1,2|1,8,1", data.table = FALSE)

Upvotes: 1

akrun
akrun

Reputation: 887138

Here is an option using tidyverse. We expand the rows of the dataset by splitting at | of the 'cats' column, then separate the 'cats' into two column by splitting at the last , , grouped by 'id', get the sum of the 'freq' column, extract the numbers at the end of the 'cats', convert it to factor with levels specified, create a column of 1s ('val'), spread it to 'wide' format

library(tidyverse)
o1 <- df2 %>% 
       separate_rows(cats, sep = "[|]") %>% 
       separate(cats, into = c('cats', 'freq'), 
           sep=",(?=[^,]+$)", convert = TRUE) %>%
       group_by(id) %>%
       mutate(freq = sum(freq), 
              cats = factor(str_extract(cats, "\\d+$"), levels = 1:10), 
              val = 1)  %>% 
       spread(cats, val, fill = 0) %>% 
       rename_at(-1, ~ paste0('1_', .))

Now, we assign values for the columns that are common with the initial dataset ('df1')

df1[is.na(df1)] <- 0
df1[names(o1)[-1]] <- o1[-1]
df1
#       id 1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10 1_freq
#1 111.txt   0   0   0   0   0   0   1   0   0    0      1
#2 112.txt   1   0   1   0   0   0   0   0   0    0      4
#3 113.txt   0   0   0   0   0   1   0   0   0    1      3
#4 114.txt   0   0   0   1   0   0   0   0   0    0      2
#5 115.txt   0   0   0   0   1   0   0   0   0    0      1
#6 116.txt   1   0   0   0   0   0   0   1   0    0      3

data

df1 <- structure(list(id = c("111.txt", "112.txt", "113.txt", "114.txt", 
"115.txt", "116.txt"), `1_1` = c(NA, NA, NA, NA, NA, NA), `1_2` = c(NA, 
NA, NA, NA, NA, NA), `1_3` = c(NA, NA, NA, NA, NA, NA), `1_4` = c(NA, 
NA, NA, NA, NA, NA), `1_5` = c(NA, NA, NA, NA, NA, NA), `1_6` = c(NA, 
NA, NA, NA, NA, NA), `1_7` = c(NA, NA, NA, NA, NA, NA), `1_8` = c(NA, 
NA, NA, NA, NA, NA), `1_9` = c(NA, NA, NA, NA, NA, NA), `1_10` = c(NA, 
NA, NA, NA, NA, NA), `1_freq` = c(NA, NA, NA, NA, NA, NA)),
    class = "data.frame", row.names = c(NA, 
-6L))

df2 <- structure(list(id = c("111.cats", "112.cats", "113.cats", "114.cats", 
"115.cats", "116.cats"), cats = c("1,7,1", "1,1,2|1,3,2", "1,10,1|1,6,2", 
"1,4,2", "1,5,1", "1,1,2|1,8,1")), class = "data.frame", row.names = c(NA, 
-6L))

Upvotes: 2

Related Questions