Brayn
Brayn

Reputation: 416

Look up value in data frame stored over multiple columns

I am trying to calculate the average amount of money actors are generating for movies. In my dataset this is defined by gross as shown in a snippet below:

movie_title  actor_1_name     actor_2_name     actor_3_name  ...  gross
x            Christoph Waltz  Johnny Depp      Emma Stone    x    309404152
x            Mark Addy        Christopher Lee  Naomi Watts   x    73058679 
x            Will Smith       Tom Hanks        *NA*          x    179020854
x            Tony Curran      Jeremy Renner    Matt Damon    x    93417865
x            Chris Evans      Eva Green        *NA*          x    85313679
..etc  

First thing I did was transform the three actor_X_name columns into one single column of unique actor names with the following line:

actors <- unique(data.frame(actor_name = c(df[,"actor_1_name"], df[,"actor_2_name"], df[,"actor_3_name"])))

The above returns a new data frame with a single column of unique actor names from the original dataset.

How can I best interate through these ~4900 observations, looking up the value in the original data frame across actor_X_name's (3 conditions)? The goal being to fetch the gross of the matching row so I can calculate on that.

Upvotes: 1

Views: 67

Answers (1)

Edo
Edo

Reputation: 7848

Given your input:

df <- tibble::tribble(~title, ~actor_1_name, ~actor_2_name, ~actor_3_name, ~gross,
                      "A", "Christoph Waltz",  "Johnny Depp",      "Emma Stone",    309404152,
                      "B", "Mark Addy",        "Christopher Lee",  "Naomi Watts",   73058679 ,
                      "C", "Will Smith",       "Tom Hanks",        NA,            179020854,
                      "D", "Tony Curran",      "Jeremy Renner",    "Matt Damon",    93417865,
                      "E", "Chris Evans",      "Eva Green",        NA,           85313679)
df
#> # A tibble: 5 x 5
#>   title actor_1_name    actor_2_name    actor_3_name     gross
#>   <chr> <chr>           <chr>           <chr>            <dbl>
#> 1 A     Christoph Waltz Johnny Depp     Emma Stone   309404152
#> 2 B     Mark Addy       Christopher Lee Naomi Watts   73058679
#> 3 C     Will Smith      Tom Hanks       NA           179020854
#> 4 D     Tony Curran     Jeremy Renner   Matt Damon    93417865
#> 5 E     Chris Evans     Eva Green       NA            85313679

You can achieve your goal with one single tidyr function: pivot_longer

library(tidyr)
df %>% pivot_longer(matches("actor_\\d_name"))
#> # A tibble: 15 x 4
#>    title     gross name         value          
#>    <chr>     <dbl> <chr>        <chr>          
#>  1 A     309404152 actor_1_name Christoph Waltz
#>  2 A     309404152 actor_2_name Johnny Depp    
#>  3 A     309404152 actor_3_name Emma Stone     
#>  4 B      73058679 actor_1_name Mark Addy      
#>  5 B      73058679 actor_2_name Christopher Lee
#>  6 B      73058679 actor_3_name Naomi Watts    
#>  7 C     179020854 actor_1_name Will Smith     
#>  8 C     179020854 actor_2_name Tom Hanks      
#>  9 C     179020854 actor_3_name NA             
#> 10 D      93417865 actor_1_name Tony Curran    
#> 11 D      93417865 actor_2_name Jeremy Renner  
#> 12 D      93417865 actor_3_name Matt Damon     
#> 13 E      85313679 actor_1_name Chris Evans    
#> 14 E      85313679 actor_2_name Eva Green      
#> 15 E      85313679 actor_3_name NA             

With matches you're selecting all columns written in that way.

The action you perform with pivot_longer is to reshape your data so that the columns selected by matches become one and the content of the other columns is repeated for each row.

title and gross will work as unique identifier of each row.


If you need to calculate the mean gross for each actor, you can use:

library(tidyr)
library(dplyr)

df %>% 
 pivot_longer(matches("actor_\\d_name"), values_to = "actor_name") %>% 
 filter(!is.na(actor_name)) %>% 
 group_by(actor_name) %>% 
 summarise(mean_gross = mean(gross), .groups = "drop")
#> # A tibble: 14 x 2
#>    actor_name      mean_gross
#>    <chr>                <dbl>
#>  1 Chris Evans      85313679 
#>  2 Christoph Waltz 309404152 
#>  3 Christopher Lee  73058679 
#>  4 Emma Stone      309404152 
#>  5 Eva Green        85313679 
#>  6 Jeremy Renner    93417865 
#>  7 Johnny Depp     309404152 
#>  8 Mark Addy        73058679 
#>  9 Matt Damon       93417865 
#> 10 Naomi Watts      73058679 
#> 11 Tom Hanks       179020854 
#> 12 Tony Curran      93417865 
#> 13 Will Smith      179020854 

Note that I filtered out the NAs, because they have no meaning. Also in pivot_longer I defined that name of the new column I created, so to be more readable and consistent with your intents.

Upvotes: 3

Related Questions