Reputation: 209
I'm trying to figure out how I can use the rollapply
function (from the Zoo
package) to find sequences of most common strings within a dataset, but I also need to do group certain variables (e.g. date, row, etc.)
Before I go any further, it's worth noting that this query builds on a question that I previously posted here : How can I find most common sequences (of strings) in my data using Tableau?
The solution offered there works really well, but I now want to apply it to a different dataset which provides some new challenges! Here's an example of the data that I'm working with in this new dataset:
structure(list(Title = c("Dragons' Den", "One Hot Summer", "Keeping Faith",
"Cuckoo", "Match of the Day", "Sportscene", "Sportscene", "The Irish League Show",
"Match of the Day", "EastEnders", "Dragons' Den", "Fake or Fortune?",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "Hidden", "Train Surfing Wars: A Matter of Life and Death",
"Bollywood: The World's Biggest Film Industry", "One Hot Summer",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "EastEnders", "Match of the Day",
"Dragons' Den", "The Next Step", "Doctor Who Series 11 Trailer",
"Doctor Who", "Doctor Who", "Doctor Who", "Picnic at Hanging Rock",
"Sylvia", "Keeping Faith", "Cardinal: Blackfly Season", "Picnic at Hanging Rock",
"Age Before Beauty", "One Hot Summer", "Stewart Lee's Comedy Vehicle",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "EastEnders", "Age Before Beauty",
"Holby City", "Who Do You Think You Are?", "Louis Theroux: Dark States",
"Louis Theroux: Dark States", "Louis Theroux", "Louis Theroux's Weird Weekends",
"Picnic at Hanging Rock", "Sylvia", "Keeping Faith", "Cardinal: Blackfly Season"
), Programme_Genre = c("Entertainment", "Documentary", "Drama",
"New SeriesComedy", "Sport", "Sport", "Sport", "Sport", "Sport",
"Drama", "Entertainment", "Documentary", "Comedy", "Drama", "Comedy",
"Documentary", "Crime Drama", "Documentary", "Documentary", "Documentary",
"Comedy", "Drama", "Comedy", "Documentary", "Drama", "Sport",
"Entertainment", "CBBC", "Sci-Fi", "Sci-Fi", "Sci-Fi", "Sci-Fi",
"Drama", "Film", "Drama", "Crime Drama", "On Now", "Drama", "Documentary",
"Comedy", "Comedy", "Drama", "Comedy", "Documentary", "Drama",
"Drama", "Drama", "History", "Documentary", "Documentary", "Documentary",
"Archive", "Drama", "Film", "Drama", "Crime Drama"), Programme_Category = c("Featured",
"Featured", "Featured", "Featured", "This Weekend's Football",
"This Weekend's Football", "This Weekend's Football", "This Weekend's Football",
"Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Box Sets", "Box Sets", "Box Sets", "Box Sets", "Featured", "Featured",
"Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", "Box Sets",
"Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Doctor Who S1-S10", "Doctor Who S1-S10", "Doctor Who S1-S10",
"Doctor Who S1-S10", "Drama", "Drama", "Drama", "Drama", "Featured",
"Featured", "Featured", "Featured", "Box Sets", "Box Sets", "Box Sets",
"Box Sets", "Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Louis Theroux", "Louis Theroux", "Louis Theroux", "Louis Theroux",
"Drama", "Drama", "Drama", "Drama"), date = c("13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018"), column = c("1",
"2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2",
"3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3",
"4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4",
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1",
"2", "3", "4"), row = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "1", "1", "1", "1", "2",
"2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4", "5", "5",
"5", "5", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3",
"3", "4", "4", "4", "4", "5", "5", "5", "5")), row.names = c(NA,
-56L), class = "data.frame")
Apologies but I'm not quite sure about best practice for sharing data. Hope the above works. It should look something like this:
Title Programme_Genre Programme_Category date column row
1 Dragons Den Entertainment Featured 13/08/2018 1 1
2 One Hot Summer Documentary Featured 13/08/2018 2 1
3 Keeping Faith Drama Featured 13/08/2018 3 1
4 Cuckoo New Series Comedy Featured 13/08/2018 4 1
5 Match of the Day Sport This Weekends... 13/08/2018 1 2
6 Sportscene Sport This Weekends... 13/08/2018 2 2
What I want to do is to use the rollapply
function similar to how it was suggested in my previous question (see link above) but only on looking for sequences that appear on the same date and across a certain range of columns. For example, I want to know what the most common sequence of genre ("Programme_Genre") is but I only want the rollapply
function to do this across columns 1-4 for each row on each date. I'm sure I'm not explaining this very well (I don't come from a data science background in case you hadn't guessed) so I'm more than happy to elaborate if necessary. Thanks in advance!
Upvotes: 3
Views: 333
Reputation: 26218
In this case also, I suggest you a similar strategy suggested in linked question.
Firstly load the libraries
library(tidyverse)
library(runner)
Strategy for say n=3
n <- 3
data %>%
group_by(date) %>%
mutate(l_seq = runner(x = Programme_Genre,
k = n,
function(x) ifelse(length(x) == n, list(x), list(rep(NA, n)))
)
) %>%
ungroup() %>%
group_split(date) %>%
map_df(., ~ map_df(.x$l_seq, ~setNames(.x, paste0('Col', seq_len(n)))) %>%
mutate(date = .x$date) %>%
na.omit() %>%
group_by_all() %>%
summarise(m = n(), .groups = 'drop') %>%
filter(m == max(m) & m > 1)
)
# A tibble: 2 x 5
Col1 Col2 Col3 date m
<chr> <chr> <chr> <chr> <int>
1 Sport Sport Sport 13/08/2018 3
2 Sci-Fi Sci-Fi Sci-Fi 14/08/2018 2
Needless to say m
is the column giving you maximum count of sequence on that particular date
say if n=4
, the above syntax gives you following results
# A tibble: 1 x 6
Col1 Col2 Col3 Col4 date m
<chr> <chr> <chr> <chr> <chr> <int>
1 Sport Sport Sport Sport 13/08/2018 2
There is no sequence of length more than 1 for length 5
in the sample
Upvotes: 1
Reputation: 5898
With tidyverse, zoo and lubridate, try:
library(tidyverse)
library(zoo)
library(lubridate)
df %>%
mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
group_split(row, date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
# Step 3 (below). Loops the solution to the previous question, gets a DF, and assigns the date and row signals to each observation.
map_df(.x = . ,
.f = ~(rollapply(data = .x$Programme_Genre , 3, c) %>%
as_tibble() %>%
mutate(date = unique(.x$date), row = unique(.x$row)))) %>%
group_by_all() %>%
tally() %>%
arrange(date, row, n)
# A tibble: 26 x 6
# Groups: V1, V2, V3, date [26]
V1 V2 V3 date row n
<chr> <chr> <chr> <date> <chr> <int>
1 Documentary Drama New SeriesComedy 2018-08-13 1 1
2 Entertainment Documentary Drama 2018-08-13 1 1
3 Sport Sport Sport 2018-08-13 2 2
4 Drama Entertainment Documentary 2018-08-13 3 1
5 Sport Drama Entertainment 2018-08-13 3 1
6 Comedy Drama Comedy 2018-08-13 4 1
7 Drama Comedy Documentary 2018-08-13 4 1
8 Crime Drama Documentary Documentary 2018-08-14 1 1
9 Documentary Documentary Documentary 2018-08-14 1 1
10 Comedy Drama Comedy 2018-08-14 2 1
# ... with 16 more rows
Upvotes: 1