Reputation: 35
I am trying to find a way to group rows and assign an index to each group without using a loop. The difficulty is that the grouping variable num has no unique identifier; num is a vector of numbers (defined as character). I want to group all rows that match at least by any of the numbers in the vector. The vectors are of different lengths and include up to 20 numbers. Let me give an example:
I have a dataframe like this:
df <- data.frame(id = c(1:5), num = c('111;222', '333;111;444', '000;88888;1', '9999;111', '1'))
I split num by the separator ; such that every row becomes a vector:
library(dplyr)
df <- df %>%
mutate(num = str_split(num, ';'))
I want to index all rows for which at least one number in the vector matches with the same index. The result should look as follows:
id num group_index
1 c('111','222') 1
2 c('333','111','444') 1
3 c('000','88888','1') 2
4 c('9999','444') 1
5 '1' 2
The example illustrates another difficulty: Group 1 is identified by '111' as well as '444', even though row 1 is part of group 1 and does not contain '444' in num.
If num was just a string, I would do the following
df <- group_by(num) %>%
mutate(group_index = group_indices(.,num))
Now, I thought I should perhaps start by identifying groups. A first approach, which does not work is:
df <- df %>%
group_by_if(num, any(num, str_c(num, collapse = '|')) == T)
I know that I could start by writing a loop. However, R is not very efficient with loops, so I would prefer a solution without a loop – if there is any? Any hints would help!
Upvotes: 2
Views: 176
Reputation: 27732
Ok, this answer can probably be shortened (probably a lot), but I think the use of igraph
keeps every nice and visible for visual inspection of the number of groups.
library( data.table )
library( igraph )
#make df a data.table
setDT(df)
#split num-column to v1, v2, ... ,vn
df[, paste0("v", 1:length( tstrsplit(df$num, ";"))) := tstrsplit( num, ";")]
# id num v1 v2 v3
# 1: 1 111;222 111 222 <NA>
# 2: 2 333;111;444 333 111 444
# 3: 3 000;88888;1 000 88888 1
# 4: 4 9999;111 9999 111 <NA>
# 5: 5 1 1 <NA> <NA>
#now melt to long format
df.melt <- melt(df, id.vars = "id", measure.vars = patterns("^v[0-9]"), value.name = "from" )
#create links
df.melt[, to := shift( from, type = "lead"), by = .(id)][]
#drop inomplete rows
df.melt <- df.melt[ complete.cases(df.melt), ]
# id variable from to
# 1: 1 v1 111 222
# 2: 2 v1 333 111
# 3: 3 v1 000 88888
# 4: 4 v1 9999 111
# 5: 2 v2 111 444
# 6: 3 v2 88888 1
g = graph_from_data_frame( df.melt[ , .(from, to)])
# plot(g)
looks like we have two separate groups to work with. let's find out which node (number) belongs to which group, and use this info on the original df
dt.lookup <- as.data.table( components(g)$membership, keep.rownames = TRUE )
# V1 V2
# 1: 111 1
# 2: 333 1
# 3: 000 2
# 4: 9999 1
# 5: 88888 2
# 6: 222 1
# 7: 444 1
# 8: 1 2
#go back to the molten data of the original df
df.melt <- melt(df, id.vars = "id", measure.vars = patterns("^v[0-9]"))
df.melt <- df.melt[ complete.cases(df.melt), ]
#perform update join to get the groupnumber
df.melt[ dt.lookup, group := i.V2, on = .(value = V1) ]
# id variable value group
# 1: 1 v1 111 1
# 2: 2 v1 333 1
# 3: 3 v1 000 2
# 4: 4 v1 9999 1
# 5: 5 v1 1 2
# 6: 1 v2 222 1
# 7: 2 v2 111 1
# 8: 3 v2 88888 2
# 9: 4 v2 111 1
# 10: 2 v3 444 1
# 11: 3 v3 1 2
#summarise to go back to oroiginal df form
df.melt[, .(num = paste0( value, collapse = ";"),
group = paste0( unique(group), collapse = ",")),
by = .(id) ][]
final output
# id num group
# 1: 1 111;222 1
# 2: 2 333;111;444 1
# 3: 3 000;88888;1 2
# 4: 4 9999;111 1
# 5: 5 1 2
Upvotes: 1