Reputation: 556
I have a data frame where each row of the column 'leg_activity' is a string of comma-separated words:
structure(list(id = c("100", "100060", "100073", "100098", "100102",
"100104", "100125", "100128", "100149", "100217", "100220", "100271",
"100464", "100465", "100520", "100607", "100653", "100745", "100757",
"100760"), leg_activity = c("home", "home, car, work, car, leisure, car, other, car, leisure, car, work, car, shop, car, home",
"home, walk, leisure, walk, leisure, walk, home", "home, car, other, car, shop, car, other, car, home",
"home, car, work, car, home, car, home", "home", "home, walk, education, walk, home",
"home, car, other, car, work, car, shop, car, shop, car, home",
"home, car, shop, car, work, car, home", "home, bike, leisure, bike, home",
"home, walk, shop, walk, home", "home, pt, leisure, car, leisure, pt, home",
"home, car, education, car, home", "home, car, leisure, car, home",
"home, walk, home, walk, shop, walk, home", "home, pt, work, walk, leisure, walk, work, pt, home",
"home, pt, leisure, walk, leisure, walk, home", "home, walk, home, bike, shop, bike, home",
"home, pt, work, pt, home, walk, work, walk, home", "home")), row.names = c(2L,
15L, 20L, 24L, 31L, 33L, 40L, 43L, 48L, 70L, 73L, 93L, 147L,
148L, 156L, 174L, 188L, 213L, 214L, 220L), class = "data.frame")
In each string, I want to extract the word that appears before the word work
. work
can appear multiple times and the preceeding word needs to be extracted or counted for each time.
Eventually, I am interested to count how frequently which word appears before work
over the entire df.
What I have tried:
library(dplyr)
library(stringr)
df%>%
separate_rows(leg_activity, sep = "work, ") %>%
group_by(id) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from = n, values_from = leg_activity)
Clearly, this does not lead to the result but only separates the df into columns. So perhaps another approach is more appropriate.
Thank you very much for your help!
Upvotes: 1
Views: 1143
Reputation: 5788
Base R one liner:
table(unlist(strsplit(gsub("(\\w+\\,)\\s*(work\\,)", "\\1",
lst$leg_activity), ", ")))
Upvotes: 1
Reputation: 67778
First, a slightly smaller data set to make it easier to track the result of the code:
d = data.frame(id = 1:3, leg = c("home",
"work, R, eat, work",
"eat, work, R, work"), stringsAsFactors = FALSE)
Split the strings (strsplit
) on ", "
. Loop over the resulting list (lapply
). Get indices of "work" (which(x == "work")
), get the previous index (-1
). Use pmax
to get an empty vector if "work" is the first word. Index the words (x[<the-index>]
). Unlist and count items (table(unlist(...
).
table(unlist(lapply(strsplit(d$leg, ", "), function(x) x[pmax(0, which(x == "work") - 1)])))
# eat R
# 2 1
Given "Eventually, I am interested to count how frequently which word appears before work over the entire df.", it seems like grouping is unnecessary.
Upvotes: 2
Reputation: 4358
library(stringr)
WantedStrings <- sub(", work","",str_extract_all(df$leg_activity, "\\w+, work",simplify=T))
WantedStrings <- WantedStrings[WantedStrings != ""]
table(WantedStrings)
WantedStrings
car pt walk
5 2 2
Upvotes: 1
Reputation: 30474
You could use separate_rows
just with comma to get your words on different rows. Then, after grouping by id
you can filter
rows where the following/lead row has "work"?
library(dplyr)
df %>%
separate_rows(leg_activity, sep = ",") %>%
mutate(leg_activity = trimws(leg_activity)) %>%
group_by(id) %>%
filter(lead(leg_activity) == "work") %>%
summarise(count = n())
Output
# A tibble: 6 x 2
id count
<chr> <int>
1 100060 2
2 100102 1
3 100128 1
4 100149 1
5 100607 2
6 100757 2
Upvotes: 2