Reputation: 416
I am trying to calculate the average amount of money actors are generating for movies. In my dataset this is defined by gross
as shown in a snippet below:
movie_title actor_1_name actor_2_name actor_3_name ... gross
x Christoph Waltz Johnny Depp Emma Stone x 309404152
x Mark Addy Christopher Lee Naomi Watts x 73058679
x Will Smith Tom Hanks *NA* x 179020854
x Tony Curran Jeremy Renner Matt Damon x 93417865
x Chris Evans Eva Green *NA* x 85313679
..etc
First thing I did was transform the three actor_X_name
columns into one single column of unique actor names with the following line:
actors <- unique(data.frame(actor_name = c(df[,"actor_1_name"], df[,"actor_2_name"], df[,"actor_3_name"])))
The above returns a new data frame with a single column of unique actor names from the original dataset.
How can I best interate through these ~4900 observations, looking up the value in the original data frame across actor_X_name
's (3 conditions)? The goal being to fetch the gross
of the matching row so I can calculate on that.
Upvotes: 1
Views: 67
Reputation: 7848
Given your input:
df <- tibble::tribble(~title, ~actor_1_name, ~actor_2_name, ~actor_3_name, ~gross,
"A", "Christoph Waltz", "Johnny Depp", "Emma Stone", 309404152,
"B", "Mark Addy", "Christopher Lee", "Naomi Watts", 73058679 ,
"C", "Will Smith", "Tom Hanks", NA, 179020854,
"D", "Tony Curran", "Jeremy Renner", "Matt Damon", 93417865,
"E", "Chris Evans", "Eva Green", NA, 85313679)
df
#> # A tibble: 5 x 5
#> title actor_1_name actor_2_name actor_3_name gross
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 A Christoph Waltz Johnny Depp Emma Stone 309404152
#> 2 B Mark Addy Christopher Lee Naomi Watts 73058679
#> 3 C Will Smith Tom Hanks NA 179020854
#> 4 D Tony Curran Jeremy Renner Matt Damon 93417865
#> 5 E Chris Evans Eva Green NA 85313679
You can achieve your goal with one single tidyr
function: pivot_longer
library(tidyr)
df %>% pivot_longer(matches("actor_\\d_name"))
#> # A tibble: 15 x 4
#> title gross name value
#> <chr> <dbl> <chr> <chr>
#> 1 A 309404152 actor_1_name Christoph Waltz
#> 2 A 309404152 actor_2_name Johnny Depp
#> 3 A 309404152 actor_3_name Emma Stone
#> 4 B 73058679 actor_1_name Mark Addy
#> 5 B 73058679 actor_2_name Christopher Lee
#> 6 B 73058679 actor_3_name Naomi Watts
#> 7 C 179020854 actor_1_name Will Smith
#> 8 C 179020854 actor_2_name Tom Hanks
#> 9 C 179020854 actor_3_name NA
#> 10 D 93417865 actor_1_name Tony Curran
#> 11 D 93417865 actor_2_name Jeremy Renner
#> 12 D 93417865 actor_3_name Matt Damon
#> 13 E 85313679 actor_1_name Chris Evans
#> 14 E 85313679 actor_2_name Eva Green
#> 15 E 85313679 actor_3_name NA
With matches
you're selecting all columns written in that way.
The action you perform with pivot_longer
is to reshape your data so that the columns selected by matches
become one and the content of the other columns is repeated for each row.
title
and gross
will work as unique identifier of each row.
If you need to calculate the mean gross for each actor, you can use:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(matches("actor_\\d_name"), values_to = "actor_name") %>%
filter(!is.na(actor_name)) %>%
group_by(actor_name) %>%
summarise(mean_gross = mean(gross), .groups = "drop")
#> # A tibble: 14 x 2
#> actor_name mean_gross
#> <chr> <dbl>
#> 1 Chris Evans 85313679
#> 2 Christoph Waltz 309404152
#> 3 Christopher Lee 73058679
#> 4 Emma Stone 309404152
#> 5 Eva Green 85313679
#> 6 Jeremy Renner 93417865
#> 7 Johnny Depp 309404152
#> 8 Mark Addy 73058679
#> 9 Matt Damon 93417865
#> 10 Naomi Watts 73058679
#> 11 Tom Hanks 179020854
#> 12 Tony Curran 93417865
#> 13 Will Smith 179020854
Note that I filter
ed out the NAs, because they have no meaning.
Also in pivot_longer
I defined that name of the new column I created, so to be more readable and consistent with your intents.
Upvotes: 3