antecessor
antecessor

Reputation: 2800

Create a new categorical column based on conditions of values in other columns in a dataframe in R

Although it might seem it has been previously solved in other questions, my problem is a little bit more challenging.

Let's start with an example I copied/pasted from a Calc sheet:

enter image description here

This is the minimal reproducible example required:

Label <- c("Catalog codes:" , "Themes:", "Size:", "Score:", "Buy Now:",
           "Series:", "Catalog codes:", "Themes:", "Related items:", "Buy Now:",
           "Catalog codes:", "Themes:", "Size:", "Score:",
           "Series:", "Themes:", "Size:", "Score:", "Related items:",
           "Catalog codes:", "Themes", "Size:", "Score:", "Related items:", "Buy Now:")
example <- as.data.frame(Label)

Part of the R dataframe I have contains a column with such column (Label) and many many rows.

The point here is that a group of rows belongs to one category (let's say Group 1 and so on). You can identify the different groups in pink and white background in the previous image.

Although there is an inner order of the labels in each group, not all the groups contain the same labels.

However, the starting and ending labels in each group remains the same, depending on the label which is present. You can see that Catalog codes: and Series: start each group, whereas Buy Now:, Score: and Related items: end each group.

I would like to create a second column in this dataframe which could identify the patterns or combinations of these ending/starting labels, and later on, categorize them. The result might be similar to this image:

enter image description here

Upvotes: 0

Views: 269

Answers (2)

Gilean0709
Gilean0709

Reputation: 1098

If you search for the starting labels and the ending labels with grepl(), you can shift the ending label for one row and see where the starting and ending labels match and use that to create your group id with cumsum(). This ensures that you always group everything between the first starting label of a group and the last ending label of a group together since there can be multiple labels in a group.

Label <- c("Catalog codes:" , "Themes:", "Size:", "Score:", "Buy Now:",
           "Series:", "Catalog codes:", "Themes:", "Related items:", "Buy Now:",
           "Catalog codes:", "Themes:", "Size:", "Score:",
           "Series:", "Themes:", "Size:", "Score:", "Related items:",
           "Catalog codes:", "Themes", "Size:", "Score:", "Related items:", "Buy Now:")
example <- as.data.frame(Label)


example$Group <- paste("Group", 
                       cumsum(
                         grepl("Catalog codes:|Series:", example$Label) * c(TRUE, head(grepl("Buy Now:|Score:|Related items:", example$Label), -1))
                       )
)

# Result
            Label   Group
1  Catalog codes: Group 1
2         Themes: Group 1
3           Size: Group 1
4          Score: Group 1
5        Buy Now: Group 1
6         Series: Group 2
7  Catalog codes: Group 2
8         Themes: Group 2
9  Related items: Group 2
10       Buy Now: Group 2
11 Catalog codes: Group 3
12        Themes: Group 3
13          Size: Group 3
14         Score: Group 3
15        Series: Group 4
16        Themes: Group 4
17          Size: Group 4
18         Score: Group 4
19 Related items: Group 4
20 Catalog codes: Group 5
21         Themes Group 5
22          Size: Group 5
23         Score: Group 5
24 Related items: Group 5
25       Buy Now: Group 5

Upvotes: 2

Wimpel
Wimpel

Reputation: 27732

This answer does not complete match your desired output... see output below... Please clarify why row 6 does not have it's 'own' group, as asked in comments.

library( data.table )
setDT(example)[, Group := paste0( "Group ", 
                                  cumsum( grepl( "^Catalog codes|^Series", Label ) ) 
                                  ) ]
#              Label   Group
#  1: Catalog codes: Group 1
#  2:        Themes: Group 1
#  3:          Size: Group 1
#  4:         Score: Group 1
#  5:       Buy Now: Group 1
#  6:        Series: Group 2   <-- !!
#  7: Catalog codes: Group 3   <-- !!
#  8:        Themes: Group 3
#  9: Related items: Group 3
# 10:       Buy Now: Group 3
# 11: Catalog codes: Group 4
# 12:        Themes: Group 4
# 13:          Size: Group 4
# 14:         Score: Group 4
# 15:        Series: Group 5
# 16:        Themes: Group 5
# 17:          Size: Group 5
# 18:         Score: Group 5
# 19: Related items: Group 5
# 20: Catalog codes: Group 6
# 21:         Themes Group 6
# 22:          Size: Group 6
# 23:         Score: Group 6
# 24: Related items: Group 6
# 25:       Buy Now: Group 6
# Label   Group

Upvotes: 0

Related Questions