Sandy
Sandy

Reputation: 1148

R how to extract n-grams based rows

I have a dataframe df:

userID Score  Task_Alpha Task_Beta Task_Charlie Task_Delta 
3108   -8.00  Easy       Easy      Easy         Easy
3207    3.00  Hard       Easy      Match        Match
3350    5.78  Hard       Easy      Hard         Hard
3961    10.00 Easy       Easy      Hard         Hard


1. userID is factor variable
2. Score is numeric
3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match'

I want to see a possible association between the transitions (Task_alpha, Task_beta, Task_Charlie, Task_Delta) and Scores.

My hypothesis is that the 2-gram or bi-gramsequence Hard Hard could be associated with higher score. However, the sequence Easy Easy would be related to lower score.

In this example I have only considered 2-gram. In my actual code I want to try longer sequences as well. Just for reference, you can see that the total possible bi-grams we can have would be:

Easy Hard
Hard Easy
Easy Match
Match Easy
Hard Match
Match Hard

Question: As a first step my required overall output is something like:

Task   Task  Score 
Easy   Easy -8.00
Easy   Easy -8.00
Easy   Easy -8.00
Hard   Easy  3.00
Easy  Match  3.00
Match Match  3.00
Hard   Easy  5.78
Easy   Hard  5.78
Hard   Hard  5.78
Easy   Easy  10.00
Easy   Hard  10.00
Hard   Hard  10.00

Upvotes: 1

Views: 347

Answers (2)

Sandy
Sandy

Reputation: 1148

I have been able to solve this problem as below:

Step 1: As a first step, I have concatenated the columns:

 df$all = paste(df$Task_Alpha,
              df$Task_Beta,
              df$Task_Charlie,
              df$Task_Delta,
              sep="-")
userID  Score Task_Alpha Task_Beta Task_Charlie Task_Delta all
3108   -8.00  Easy       Easy      Easy         Easy       Easy-Easy-Easy-Easy
3207    3.00  Hard       Easy      Match        Match      Hard-Easy-Match-Match
3350    5.78  Hard       Easy      Hard         Hard       Hard-Easy-Hard-Hard
3961    10.00 Easy       Easy      Hard         Hard       Easy-Easy-Hard-Hard

Step 2: As a second step (to have a more generalized solution), I have tried the n-gram based-approach. Where I try to split the strings into any size n-gram I want

library(tidytext)
library(dplyr)

df = as_tibble(df)
df_test = df %>%
   unnest_tokens(bigram, all, token = "ngrams", n = 2)

This gives me the output:

userID Score Task_*A* Task_*B* Task_*C* Task_*D* all                   bigram
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match hard easy
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match easy match
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match match match
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard easy
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   easy hard
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard hard
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy easy
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy hard
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   hard hard

Step 3: This solution meets my requirements, even when I want to increase the size of the grams. For example, for 3-gram I can simply achieve this by:

  df = as_tibble(df)
  df_test = df %>%
    unnest_tokens(trigram, all, token = "ngrams", n = 3)

Which will yield:

userID Score Task_*A* Task_*B* Task_*C* Task_*D* all                   trigram
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy easy
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy easy
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match hard easy match
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match easy match match
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard easy hard
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   easy hard hard
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy easy hard
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy hard hard

Upvotes: 0

prosoitos
prosoitos

Reputation: 7347

First, you need to convert all your factors to characters (otherwise, in the next step, instead of using the values of your factors, R will use their indices).

One option with dplyr:

library(dplyr)

df <- df %>% mutate_if(is.factor, as.character)

Then you can do:

data.frame(Task1 = c(df[, 3], df[, 4], df[, 5]),
           Task2 = c(df[, 4], df[, 5], df[, 6]),
           Score = rep(df[, 2], 3)) %>%
  arrange(Score)

Output:

   Task1 Task2 Score
1   Easy  Easy -8.00
2   Easy  Easy -8.00
3   Easy  Easy -8.00
4   Hard  Easy  3.00
5   Easy Match  3.00
6  Match Match  3.00
7   Hard  Easy  5.78
8   Easy  Hard  5.78
9   Hard  Hard  5.78
10  Easy  Easy 10.00
11  Easy  Hard 10.00
12  Hard  Hard 10.00

Upvotes: 1

Related Questions