Jen
Jen

Reputation: 35

Group and index columns when at least one object in vector matches with at least one object in other vector r dplyr

I am trying to find a way to group rows and assign an index to each group without using a loop. The difficulty is that the grouping variable num has no unique identifier; num is a vector of numbers (defined as character). I want to group all rows that match at least by any of the numbers in the vector. The vectors are of different lengths and include up to 20 numbers. Let me give an example:

I have a dataframe like this:

df <- data.frame(id = c(1:5), num = c('111;222', '333;111;444', '000;88888;1', '9999;111', '1'))

I split num by the separator ; such that every row becomes a vector:

library(dplyr)
df <- df %>%
  mutate(num = str_split(num, ';'))

I want to index all rows for which at least one number in the vector matches with the same index. The result should look as follows:

id     num               group_index
1  c('111','222')             1
2  c('333','111','444')       1
3  c('000','88888','1')       2
4  c('9999','444')            1
5  '1'                        2

The example illustrates another difficulty: Group 1 is identified by '111' as well as '444', even though row 1 is part of group 1 and does not contain '444' in num.

If num was just a string, I would do the following

df <- group_by(num) %>%
  mutate(group_index = group_indices(.,num))

Now, I thought I should perhaps start by identifying groups. A first approach, which does not work is:

df <- df %>%
 group_by_if(num, any(num, str_c(num, collapse = '|')) == T)

I know that I could start by writing a loop. However, R is not very efficient with loops, so I would prefer a solution without a loop – if there is any? Any hints would help!

Upvotes: 2

Views: 176

Answers (1)

Wimpel
Wimpel

Reputation: 27732

Ok, this answer can probably be shortened (probably a lot), but I think the use of igraph keeps every nice and visible for visual inspection of the number of groups.

library( data.table )
library( igraph )
#make df a data.table
setDT(df)
#split num-column to v1, v2, ... ,vn
df[, paste0("v", 1:length( tstrsplit(df$num, ";"))) := tstrsplit( num, ";")]
#    id         num   v1    v2   v3
# 1:  1     111;222  111   222 <NA>
# 2:  2 333;111;444  333   111  444
# 3:  3 000;88888;1  000 88888    1
# 4:  4    9999;111 9999   111 <NA>
# 5:  5           1    1  <NA> <NA>

#now melt to long format
df.melt <- melt(df, id.vars = "id", measure.vars = patterns("^v[0-9]"), value.name = "from" )
#create links
df.melt[, to := shift( from, type = "lead"), by = .(id)][]
#drop inomplete rows
df.melt <- df.melt[ complete.cases(df.melt), ]
#    id variable  from    to
# 1:  1       v1   111   222
# 2:  2       v1   333   111
# 3:  3       v1   000 88888
# 4:  4       v1  9999   111
# 5:  2       v2   111   444
# 6:  3       v2 88888     1

g = graph_from_data_frame( df.melt[ , .(from, to)])
# plot(g)

enter image description here

looks like we have two separate groups to work with. let's find out which node (number) belongs to which group, and use this info on the original df

dt.lookup <- as.data.table( components(g)$membership, keep.rownames = TRUE )
#       V1 V2
# 1:   111  1
# 2:   333  1
# 3:   000  2
# 4:  9999  1
# 5: 88888  2
# 6:   222  1
# 7:   444  1
# 8:     1  2


#go back to the molten data of the original df
df.melt <- melt(df, id.vars = "id", measure.vars = patterns("^v[0-9]"))
df.melt <- df.melt[ complete.cases(df.melt), ]
#perform update join to get the groupnumber
df.melt[ dt.lookup, group := i.V2, on = .(value = V1) ]
#    id variable value group
# 1:  1       v1   111     1
# 2:  2       v1   333     1
# 3:  3       v1   000     2
# 4:  4       v1  9999     1
# 5:  5       v1     1     2
# 6:  1       v2   222     1
# 7:  2       v2   111     1
# 8:  3       v2 88888     2
# 9:  4       v2   111     1
# 10: 2       v3   444     1
# 11: 3       v3     1     2

#summarise to go back to oroiginal df form
df.melt[, .(num = paste0( value, collapse = ";"),
            group = paste0( unique(group), collapse = ",")),
        by = .(id) ][]

final output

#    id         num group
# 1:  1     111;222     1
# 2:  2 333;111;444     1
# 3:  3 000;88888;1     2
# 4:  4    9999;111     1
# 5:  5           1     2

Upvotes: 1

Related Questions