Reputation: 556
I have a df which shows an "activity chain" of people, which looks like this (snipet at the bottom of question):
head(agents)
id leg_activity
1 9 home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home
2 10 home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home
3 11 home, work, adpt, home
4 96 home, car, work, car, home, work, adpt, home
5 97 home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home
6 101 home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home
What I'm interested in is to detect a pattern of the occurrences of adpt
. The simplest way is to use the count()
function, which gives me a frequency table as an output. Unfortunately, this result will be misleading.
This is how that looks:
x freq
home, adpt, work, adpt, home 2071
home, adpt, shop, adpt, home 653
home, adpt, education, adpt, home 545
home, pt, work, adpt, home 492
home, adpt, work, pt, home 468
home, adpt, work, home 283
The problem with this approach is that I can't detect patterns in longer activity chains; for example:
home, adpt, education, adpt, education, adpt, home, car, work, car, home, shop, adpt, home
This case has at the beginning an activity chain, which is very frequent, but as further activities are followed, it does not count with the count
function.
Is there a way to use a count function that also takes into consideration what happens inside the cell? So it would be interesting to have a table which shows all combinations possible and their frequency, like this:
x freq
home, adpt, home 10
home, adpt, home, pt, work, home 4
home, pt, work, home 2
Thank you a lot for the help!
the data:
structure(list(id = c(9L, 10L, 11L, 96L, 97L, 101L, 103L, 248L,
499L, 1044L, 1215L, 1238L, 1458L, 1569L, 1615L, 1626L, 1734L,
1735L, 1790L, 1912L, 9040L, 14858L, 14859L, 14967L, 15011L, 15012L,
15015L, 15045L, 15050L, 15058L, 15060L, 15086L, 15088L, 15094L,
15109L, 15113L, 15152L, 15157L, 15192L, 15193L, 15222L, 15230L,
15231L, 15234L, 15235L, 15237L, 15256L, 15257L, 15258L, 15269L,
15270L, 15318L, 15319L, 15338L, 15369L, 15371L, 15396L, 15397L,
15399L, 15404L, 15505L, 15506L, 15515L, 15516L, 15525L, 15542L,
15593L, 15602L, 15608L, 15643L, 15667L, 15727L, 15728L, 15729L,
15752L, 15775L, 15808L, 15851L, 15869L, 15881L, 15882L, 15960L,
15962L, 15966L, 16058L, 16107L, 16174L, 16229L, 16237L, 16238L,
16291L, 16333L, 16416L, 16418L, 16449L, 16450L, 16451L, 16491L,
16506L, 16508L), leg_activity = c("home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home",
"home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home",
"home, work, adpt, home", "home, car, work, car, home, work, adpt, home",
"home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home",
"home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home",
"home, adpt, work, adpt, home, walk, other, pt, home", "home, adpt, work, walk, home, adpt, work, walk, home",
"home, adpt, leisure, adpt, home, bike, outside, bike, home",
"home, pt, work, adpt, home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, outside, car, work, car, work, car, home",
"home, work, leisure, adpt, home", "home, outside, pt, home, adpt, leisure, adpt, home",
"home, car_passenger, leisure, walk, work, walk, leisure, walk, work, adpt, home, walk, home",
"home, adpt, work, walk, work, walk, work, pt, home", "home, car, work, pt, leisure, adpt, work, car, home, car, home",
"home, adpt, other, adpt, home, car, home", "home, adpt, other, adpt, home",
"home, education, walk, shop, walk, education, pt, outside, home, adpt, leisure, adpt, home",
"home, adpt, work, adpt, home, walk, home", "home, adpt, work, pt, leisure, adpt, work, adpt, work, adpt, home, adpt, other, walk, home",
"home, adpt, work, adpt, home, adpt, work, adpt, home, walk, leisure, walk, home",
"home, adpt, work, adpt, home, work, adpt, home, walk, leisure, walk, home",
"home, adpt, work, adpt, home, car_passenger, outside, car_passenger, leisure, car_passenger, home, car_passenger, home",
"home, adpt, other, adpt, home, car, work, car, home", "home, adpt, education, adpt, leisure, adpt, home, walk, leisure, walk, home",
"home, car_passenger, other, pt, home, walk, other, walk, home, car_passenger, other, walk, home, adpt, other, adpt, home",
"home, work, pt, work, adpt, work, adpt, home", "home, adpt, leisure, adpt, home, car, shop, car, other, car, home",
"home, adpt, work, adpt, home, walk, other, adpt, home", "home, adpt, work, adpt, home, car_passenger, leisure, car_passenger, home",
"home, car, other, car, home, adpt, shop, adpt, home", "home, pt, work, adpt, home",
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home",
"home, walk, education, adpt, home, walk, education, walk, home, bike, leisure, bike, home",
"home, adpt, shop, adpt, home, car, home", "home, adpt, leisure, walk, leisure, walk, leisure, adpt, home",
"home, adpt, shop, pt, home, adpt, other, adpt, home", "home, adpt, other, adpt, home, car_passenger, leisure, walk, home",
"home, adpt, work, adpt, home, car_passenger, shop, car_passenger, home",
"home, adpt, other, adpt, work, adpt, home", "home, adpt, work, adpt, home, adpt, other, walk, shop, walk, home, car, outside, car, outside, car, outside, car, home",
"home, adpt, other, adpt, home", "home, adpt, education, adpt, home, adpt, education, adpt, home",
"home, pt, work, adpt, work, adpt, work, adpt, work, adpt, home, adpt, work, adpt, home",
"home, walk, other, car_passenger, education, walk, home, car_passenger, education, adpt, home",
"home, walk, shop, walk, home, walk, leisure, adpt, leisure, adpt, home",
"home, adpt, work, adpt, home, walk, shop, walk, home, walk, leisure, walk, home, walk, home",
"home, adpt, leisure, adpt, home", "home, walk, leisure, walk, home, adpt, other, adpt, shop, walk, leisure, walk, home",
"home, pt, leisure, adpt, home, pt, outside, pt, home, bike, leisure, bike, home",
"home, pt, outside, pt, home, walk, home, walk, other, adpt, shop, pt, home, car_passenger, leisure, adpt, home",
"home, adpt, work, adpt, home, adpt, shop, adpt, work, adpt, home",
"home, adpt, shop, adpt, other, walk, home", "home, walk, other, walk, home, walk, home, adpt, other, adpt, home, adpt, shop, adpt, home, car, other, car, home, adpt, other, adpt, home",
"home, adpt, leisure, pt, home", "home, leisure, adpt, home",
"home, adpt, leisure, pt, shop, walk, home, walk, shop, walk, home",
"home, car, outside, car, outside, leisure, car, outside, car, outside, car, home, adpt, other, adpt, home",
"home, adpt, work, adpt, shop, walk, home", "home, adpt, other, walk, work, adpt, home, adpt, other, adpt, work, adpt, home, adpt, leisure, adpt, home",
"home, adpt, leisure, adpt, home, car, shop, car, home", "home, walk, shop, adpt, home, car, other, car, home, adpt, other, adpt, home",
"home, walk, leisure, walk, home, adpt, work, adpt, home", "home, adpt, work, adpt, home",
"home, adpt, leisure, pt, shop, adpt, home, adpt, leisure, walk, home",
"home, walk, other, walk, leisure, walk, home, car, leisure, car, home, walk, leisure, adpt, home",
"home, adpt, work, adpt, home", "home, walk, leisure, walk, home, adpt, leisure, adpt, home, adpt, leisure, walk, home",
"home, walk, home, walk, shop, walk, home, walk, leisure, walk, home, adpt, other, adpt, home",
"home, car_passenger, outside, car_passenger, outside, car_passenger, home, adpt, other, adpt, home",
"home, walk, education, adpt, home", "home, adpt, education, walk, home, bike, education, bike, home",
"home, adpt, other, adpt, home, adpt, shop, pt, home", "home, adpt, other, adpt, shop, walk, home, adpt, leisure, car_passenger, home",
"home, adpt, work, adpt, other, adpt, home", "home, adpt, work, adpt, home",
"home, adpt, work, adpt, home, walk, home", "home, car, work, adpt, leisure, adpt, work, car, home",
"home, adpt, shop, adpt, home, car, other, car, home, car_passenger, outside, car_passenger, home",
"home, adpt, work, pt, home, car, shop, car, home", "home, walk, other, adpt, work, adpt, shop, adpt, shop, adpt, home",
"home, adpt, leisure, adpt, shop, adpt, leisure, pt, home", "home, adpt, leisure, adpt, shop, adpt, home",
"home, car, outside, car, outside, car, outside, car, outside, car, home, adpt, education, pt, home",
"home, adpt, work, adpt, home", "home, adpt, shop, adpt, home",
"home, adpt, education, adpt, home, adpt, education, adpt, home",
"home, adpt, other, adpt, other, walk, leisure, adpt, other, adpt, home",
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, other, car, home",
"home, car, work, car, shop, car, home, adpt, work, adpt, home, car, home",
"home, walk, other, walk, education, adpt, home, adpt, education, walk, home, walk, home",
"home, adpt, shop, walk, leisure, adpt, home", "home, adpt, shop, walk, home, adpt, work, adpt, home",
"home, adpt, leisure, adpt, shop, walk, home", "home, walk, other, adpt, shop, walk, home, walk, other, walk, home, walk, other, walk, other, adpt, home",
"home, adpt, education, walk, home, walk, education, walk, home, walk, home",
"home, bike, education, bike, home, adpt, education, adpt, home, walk, home"
)), row.names = c(NA, 100L), class = "data.frame")
Upvotes: 1
Views: 191
Reputation: 3876
I'm not quite sure what exactly it is you want to do, but I understand that you are interested in detecting a pattern of the occurrences of the activity adpt
. This is often done in NLP, below is a solution using the tidytext
package. I split up the leg_activity
column in what is called n-grams
, i.e. I break up the text in consecutive sequence of words. A sequence of two consequtive words is called bi-gram
, three consecutive words tri-gram
etc. When we then count these n-grams
we learn which activities most often preceed adpt and which most often come after adpt.
Here is how to do it forbi-grams
:
df %>%
unnest_tokens(bigram, leg_activity, token = "ngrams", n = 2) %>%
filter(str_detect(bigram, "adpt")) %>%
count(bigram, sort = TRUE)
bigram n
1 home adpt 100
2 adpt home 97
3 work adpt 51
4 adpt work 48
5 leisure adpt 27
6 adpt other 26
7 other adpt 26
8 adpt leisure 24
9 adpt shop 22
10 shop adpt 13
11 adpt education 10
12 education adpt 10
So adpt is most often preceeded by "home" and "home" is also what comes directly after "adpt" most often. If we were interested in the three activities cosecutively occuring together and including "adpt" we can do the same for tri-grams
:
df %>%
unnest_tokens(bigram, leg_activity, token = "ngrams", n = 3) %>% #n is the only thing that changed
filter(str_detect(bigram, "adpt")) %>%
count(bigram, sort = TRUE)
bigram n
1 work adpt home 42
2 adpt work adpt 40
3 home adpt work 36
4 home adpt other 22
5 adpt other adpt 21
6 home adpt leisure 20
7 leisure adpt home 19
8 other adpt home 18
9 adpt leisure adpt 16
10 adpt home adpt 15
11 home adpt shop 12
12 adpt home car 11
13 adpt home walk 11
14 adpt shop adpt 11
15 home adpt education 10
16 education adpt home 9
[list continues]
This list is considerably longer, since now there are more possible combinations. Here a link to a good tutorial on n-grams if you want to learn more. Is this what you wanted to do?
Upvotes: 1