Yves
Yves

Reputation: 556

How to detect pattern and frequency in a column of characters, using R?

I have a df which shows an "activity chain" of people, which looks like this (snipet at the bottom of question):

head(agents)
   id                                                                                                                                                                leg_activity
1   9                                                                                      home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home
2  10 home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home
3  11                                                                                                                                                      home, work, adpt, home
4  96                                                                                                                                home, car, work, car, home, work, adpt, home
5  97                              home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home
6 101                                       home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home

What I'm interested in is to detect a pattern of the occurrences of adpt. The simplest way is to use the count() function, which gives me a frequency table as an output. Unfortunately, this result will be misleading.

This is how that looks:

x                                 freq
home, adpt, work, adpt, home      2071
home, adpt, shop, adpt, home      653
home, adpt, education, adpt, home 545
home, pt, work, adpt, home        492
home, adpt, work, pt, home        468
home, adpt, work, home            283

The problem with this approach is that I can't detect patterns in longer activity chains; for example:

 home, adpt, education, adpt, education, adpt, home, car, work, car, home, shop, adpt, home

This case has at the beginning an activity chain, which is very frequent, but as further activities are followed, it does not count with the count function.

Is there a way to use a count function that also takes into consideration what happens inside the cell? So it would be interesting to have a table which shows all combinations possible and their frequency, like this:

x                                freq
home, adpt, home                 10
home, adpt, home, pt, work, home 4
home, pt, work, home             2

Thank you a lot for the help!

the data:

structure(list(id = c(9L, 10L, 11L, 96L, 97L, 101L, 103L, 248L, 
499L, 1044L, 1215L, 1238L, 1458L, 1569L, 1615L, 1626L, 1734L, 
1735L, 1790L, 1912L, 9040L, 14858L, 14859L, 14967L, 15011L, 15012L, 
15015L, 15045L, 15050L, 15058L, 15060L, 15086L, 15088L, 15094L, 
15109L, 15113L, 15152L, 15157L, 15192L, 15193L, 15222L, 15230L, 
15231L, 15234L, 15235L, 15237L, 15256L, 15257L, 15258L, 15269L, 
15270L, 15318L, 15319L, 15338L, 15369L, 15371L, 15396L, 15397L, 
15399L, 15404L, 15505L, 15506L, 15515L, 15516L, 15525L, 15542L, 
15593L, 15602L, 15608L, 15643L, 15667L, 15727L, 15728L, 15729L, 
15752L, 15775L, 15808L, 15851L, 15869L, 15881L, 15882L, 15960L, 
15962L, 15966L, 16058L, 16107L, 16174L, 16229L, 16237L, 16238L, 
16291L, 16333L, 16416L, 16418L, 16449L, 16450L, 16451L, 16491L, 
16506L, 16508L), leg_activity = c("home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home", 
"home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home", 
"home, work, adpt, home", "home, car, work, car, home, work, adpt, home", 
"home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home", 
"home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home", 
"home, adpt, work, adpt, home, walk, other, pt, home", "home, adpt, work, walk, home, adpt, work, walk, home", 
"home, adpt, leisure, adpt, home, bike, outside, bike, home", 
"home, pt, work, adpt, home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, outside, car, work, car, work, car, home", 
"home, work, leisure, adpt, home", "home, outside, pt, home, adpt, leisure, adpt, home", 
"home, car_passenger, leisure, walk, work, walk, leisure, walk, work, adpt, home, walk, home", 
"home, adpt, work, walk, work, walk, work, pt, home", "home, car, work, pt, leisure, adpt, work, car, home, car, home", 
"home, adpt, other, adpt, home, car, home", "home, adpt, other, adpt, home", 
"home, education, walk, shop, walk, education, pt, outside, home, adpt, leisure, adpt, home", 
"home, adpt, work, adpt, home, walk, home", "home, adpt, work, pt, leisure, adpt, work, adpt, work, adpt, home, adpt, other, walk, home", 
"home, adpt, work, adpt, home, adpt, work, adpt, home, walk, leisure, walk, home", 
"home, adpt, work, adpt, home, work, adpt, home, walk, leisure, walk, home", 
"home, adpt, work, adpt, home, car_passenger, outside, car_passenger, leisure, car_passenger, home, car_passenger, home", 
"home, adpt, other, adpt, home, car, work, car, home", "home, adpt, education, adpt, leisure, adpt, home, walk, leisure, walk, home", 
"home, car_passenger, other, pt, home, walk, other, walk, home, car_passenger, other, walk, home, adpt, other, adpt, home", 
"home, work, pt, work, adpt, work, adpt, home", "home, adpt, leisure, adpt, home, car, shop, car, other, car, home", 
"home, adpt, work, adpt, home, walk, other, adpt, home", "home, adpt, work, adpt, home, car_passenger, leisure, car_passenger, home", 
"home, car, other, car, home, adpt, shop, adpt, home", "home, pt, work, adpt, home", 
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home", 
"home, walk, education, adpt, home, walk, education, walk, home, bike, leisure, bike, home", 
"home, adpt, shop, adpt, home, car, home", "home, adpt, leisure, walk, leisure, walk, leisure, adpt, home", 
"home, adpt, shop, pt, home, adpt, other, adpt, home", "home, adpt, other, adpt, home, car_passenger, leisure, walk, home", 
"home, adpt, work, adpt, home, car_passenger, shop, car_passenger, home", 
"home, adpt, other, adpt, work, adpt, home", "home, adpt, work, adpt, home, adpt, other, walk, shop, walk, home, car, outside, car, outside, car, outside, car, home", 
"home, adpt, other, adpt, home", "home, adpt, education, adpt, home, adpt, education, adpt, home", 
"home, pt, work, adpt, work, adpt, work, adpt, work, adpt, home, adpt, work, adpt, home", 
"home, walk, other, car_passenger, education, walk, home, car_passenger, education, adpt, home", 
"home, walk, shop, walk, home, walk, leisure, adpt, leisure, adpt, home", 
"home, adpt, work, adpt, home, walk, shop, walk, home, walk, leisure, walk, home, walk, home", 
"home, adpt, leisure, adpt, home", "home, walk, leisure, walk, home, adpt, other, adpt, shop, walk, leisure, walk, home", 
"home, pt, leisure, adpt, home, pt, outside, pt, home, bike, leisure, bike, home", 
"home, pt, outside, pt, home, walk, home, walk, other, adpt, shop, pt, home, car_passenger, leisure, adpt, home", 
"home, adpt, work, adpt, home, adpt, shop, adpt, work, adpt, home", 
"home, adpt, shop, adpt, other, walk, home", "home, walk, other, walk, home, walk, home, adpt, other, adpt, home, adpt, shop, adpt, home, car, other, car, home, adpt, other, adpt, home", 
"home, adpt, leisure, pt, home", "home, leisure, adpt, home", 
"home, adpt, leisure, pt, shop, walk, home, walk, shop, walk, home", 
"home, car, outside, car, outside, leisure, car, outside, car, outside, car, home, adpt, other, adpt, home", 
"home, adpt, work, adpt, shop, walk, home", "home, adpt, other, walk, work, adpt, home, adpt, other, adpt, work, adpt, home, adpt, leisure, adpt, home", 
"home, adpt, leisure, adpt, home, car, shop, car, home", "home, walk, shop, adpt, home, car, other, car, home, adpt, other, adpt, home", 
"home, walk, leisure, walk, home, adpt, work, adpt, home", "home, adpt, work, adpt, home", 
"home, adpt, leisure, pt, shop, adpt, home, adpt, leisure, walk, home", 
"home, walk, other, walk, leisure, walk, home, car, leisure, car, home, walk, leisure, adpt, home", 
"home, adpt, work, adpt, home", "home, walk, leisure, walk, home, adpt, leisure, adpt, home, adpt, leisure, walk, home", 
"home, walk, home, walk, shop, walk, home, walk, leisure, walk, home, adpt, other, adpt, home", 
"home, car_passenger, outside, car_passenger, outside, car_passenger, home, adpt, other, adpt, home", 
"home, walk, education, adpt, home", "home, adpt, education, walk, home, bike, education, bike, home", 
"home, adpt, other, adpt, home, adpt, shop, pt, home", "home, adpt, other, adpt, shop, walk, home, adpt, leisure, car_passenger, home", 
"home, adpt, work, adpt, other, adpt, home", "home, adpt, work, adpt, home", 
"home, adpt, work, adpt, home, walk, home", "home, car, work, adpt, leisure, adpt, work, car, home", 
"home, adpt, shop, adpt, home, car, other, car, home, car_passenger, outside, car_passenger, home", 
"home, adpt, work, pt, home, car, shop, car, home", "home, walk, other, adpt, work, adpt, shop, adpt, shop, adpt, home", 
"home, adpt, leisure, adpt, shop, adpt, leisure, pt, home", "home, adpt, leisure, adpt, shop, adpt, home", 
"home, car, outside, car, outside, car, outside, car, outside, car, home, adpt, education, pt, home", 
"home, adpt, work, adpt, home", "home, adpt, shop, adpt, home", 
"home, adpt, education, adpt, home, adpt, education, adpt, home", 
"home, adpt, other, adpt, other, walk, leisure, adpt, other, adpt, home", 
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, other, car, home", 
"home, car, work, car, shop, car, home, adpt, work, adpt, home, car, home", 
"home, walk, other, walk, education, adpt, home, adpt, education, walk, home, walk, home", 
"home, adpt, shop, walk, leisure, adpt, home", "home, adpt, shop, walk, home, adpt, work, adpt, home", 
"home, adpt, leisure, adpt, shop, walk, home", "home, walk, other, adpt, shop, walk, home, walk, other, walk, home, walk, other, walk, other, adpt, home", 
"home, adpt, education, walk, home, walk, education, walk, home, walk, home", 
"home, bike, education, bike, home, adpt, education, adpt, home, walk, home"
)), row.names = c(NA, 100L), class = "data.frame")

Upvotes: 1

Views: 191

Answers (1)

Ahorn
Ahorn

Reputation: 3876

I'm not quite sure what exactly it is you want to do, but I understand that you are interested in detecting a pattern of the occurrences of the activity adpt. This is often done in NLP, below is a solution using the tidytext package. I split up the leg_activity column in what is called n-grams, i.e. I break up the text in consecutive sequence of words. A sequence of two consequtive words is called bi-gram, three consecutive words tri-gram etc. When we then count these n-grams we learn which activities most often preceed adpt and which most often come after adpt.

Here is how to do it forbi-grams:

df %>% 
  unnest_tokens(bigram, leg_activity, token = "ngrams", n = 2) %>% 
  filter(str_detect(bigram, "adpt")) %>% 
  count(bigram, sort = TRUE)

           bigram   n
1       home adpt 100
2       adpt home  97
3       work adpt  51
4       adpt work  48
5    leisure adpt  27
6      adpt other  26
7      other adpt  26
8    adpt leisure  24
9       adpt shop  22
10      shop adpt  13
11 adpt education  10
12 education adpt  10

So adpt is most often preceeded by "home" and "home" is also what comes directly after "adpt" most often. If we were interested in the three activities cosecutively occuring together and including "adpt" we can do the same for tri-grams:

df %>% 
  unnest_tokens(bigram, leg_activity, token = "ngrams", n = 3) %>%  #n is the only thing that changed
  filter(str_detect(bigram, "adpt")) %>% 
  count(bigram, sort = TRUE)

                    bigram  n
1                work adpt home 42
2                adpt work adpt 40
3                home adpt work 36
4               home adpt other 22
5               adpt other adpt 21
6             home adpt leisure 20
7             leisure adpt home 19
8               other adpt home 18
9             adpt leisure adpt 16
10               adpt home adpt 15
11               home adpt shop 12
12                adpt home car 11
13               adpt home walk 11
14               adpt shop adpt 11
15          home adpt education 10
16          education adpt home  9
[list continues]

This list is considerably longer, since now there are more possible combinations. Here a link to a good tutorial on n-grams if you want to learn more. Is this what you wanted to do?

Upvotes: 1

Related Questions