Reputation: 2800
Although it might seem it has been previously solved in other questions, my problem is a little bit more challenging.
Let's start with an example I copied/pasted from a Calc sheet:
This is the minimal reproducible example required:
Label <- c("Catalog codes:" , "Themes:", "Size:", "Score:", "Buy Now:",
"Series:", "Catalog codes:", "Themes:", "Related items:", "Buy Now:",
"Catalog codes:", "Themes:", "Size:", "Score:",
"Series:", "Themes:", "Size:", "Score:", "Related items:",
"Catalog codes:", "Themes", "Size:", "Score:", "Related items:", "Buy Now:")
example <- as.data.frame(Label)
Part of the R dataframe I have contains a column with such column (Label
) and many many rows.
The point here is that a group of rows belongs to one category (let's say Group 1
and so on). You can identify the different groups in pink and white background in the previous image.
Although there is an inner order of the labels in each group, not all the groups contain the same labels.
However, the starting and ending labels in each group remains the same, depending on the label which is present. You can see that Catalog codes:
and Series:
start each group, whereas Buy Now:
, Score:
and Related items:
end each group.
I would like to create a second column in this dataframe which could identify the patterns or combinations of these ending/starting labels, and later on, categorize them. The result might be similar to this image:
Upvotes: 0
Views: 269
Reputation: 1098
If you search for the starting labels and the ending labels with grepl()
, you can shift the ending label for one row and see where the starting and ending labels match and use that to create your group id with cumsum()
. This ensures that you always group everything between the first starting label of a group and the last ending label of a group together since there can be multiple labels in a group.
Label <- c("Catalog codes:" , "Themes:", "Size:", "Score:", "Buy Now:",
"Series:", "Catalog codes:", "Themes:", "Related items:", "Buy Now:",
"Catalog codes:", "Themes:", "Size:", "Score:",
"Series:", "Themes:", "Size:", "Score:", "Related items:",
"Catalog codes:", "Themes", "Size:", "Score:", "Related items:", "Buy Now:")
example <- as.data.frame(Label)
example$Group <- paste("Group",
cumsum(
grepl("Catalog codes:|Series:", example$Label) * c(TRUE, head(grepl("Buy Now:|Score:|Related items:", example$Label), -1))
)
)
# Result
Label Group
1 Catalog codes: Group 1
2 Themes: Group 1
3 Size: Group 1
4 Score: Group 1
5 Buy Now: Group 1
6 Series: Group 2
7 Catalog codes: Group 2
8 Themes: Group 2
9 Related items: Group 2
10 Buy Now: Group 2
11 Catalog codes: Group 3
12 Themes: Group 3
13 Size: Group 3
14 Score: Group 3
15 Series: Group 4
16 Themes: Group 4
17 Size: Group 4
18 Score: Group 4
19 Related items: Group 4
20 Catalog codes: Group 5
21 Themes Group 5
22 Size: Group 5
23 Score: Group 5
24 Related items: Group 5
25 Buy Now: Group 5
Upvotes: 2
Reputation: 27732
This answer does not complete match your desired output... see output below... Please clarify why row 6 does not have it's 'own' group, as asked in comments.
library( data.table )
setDT(example)[, Group := paste0( "Group ",
cumsum( grepl( "^Catalog codes|^Series", Label ) )
) ]
# Label Group
# 1: Catalog codes: Group 1
# 2: Themes: Group 1
# 3: Size: Group 1
# 4: Score: Group 1
# 5: Buy Now: Group 1
# 6: Series: Group 2 <-- !!
# 7: Catalog codes: Group 3 <-- !!
# 8: Themes: Group 3
# 9: Related items: Group 3
# 10: Buy Now: Group 3
# 11: Catalog codes: Group 4
# 12: Themes: Group 4
# 13: Size: Group 4
# 14: Score: Group 4
# 15: Series: Group 5
# 16: Themes: Group 5
# 17: Size: Group 5
# 18: Score: Group 5
# 19: Related items: Group 5
# 20: Catalog codes: Group 6
# 21: Themes Group 6
# 22: Size: Group 6
# 23: Score: Group 6
# 24: Related items: Group 6
# 25: Buy Now: Group 6
# Label Group
Upvotes: 0