M O R E L
M O R E L

Reputation: 17

Question about my computation (Using R with dplyr and nyflights13 to figure out number of seat miles by carrier)

I understand the problem and showed all my work. I'm working through the modern dive data science book (https://moderndive.com/3-wrangling.html#joins book), and got stuck on (LC3.20) at the end of chapter 3.Using the nycflights13 package on R and dplyr, I'm supposed to generate a tibble that has only two columns, airline name and seat miles. Seat miles is just seats * miles. I understand the problem and I thought my code was going to output the correct result, however my seat miles are different for each airline carrier than in the solution. Can someone please help me to figure out why my code went wrong. Additionally, I do understand the books solution, I just don't know why my solution is wrong. I posted all my work.

   #seat miles = miles*seats

View(flights) #distance and identifiers year and tail num and carrier
View(airlines) # names and indentifiers carrier
View(planes) #seats and identifiers year and tail num


#join names to flights 

named_flights <- flights %>%
  inner_join(airlines, by = 'carrier')

named_flights #same number of rows, all good
flights

#join seats to named_flights

named_seat_flights <- named_flights %>%
  inner_join(planes, by = c('tailnum'))
named_seat_flights #noticed 52,596 rows are missing
#when joining tailnum to named_flights

table(is.na(select(named_flights, 'tailnum')))
#2512 rows that has NA values for tailnum in named_flights

table(is.na(select(planes, 'tailnum')))
#no tailnum data is missing from planes dataset
#and since a given plane (with a given tailnum)
#can take off multiple times per year
#we can conclude that the 52,596 missing rows
#are a result of the missing tailnum data in flights (also named_flights)


named_seat_miles_by_airline_name <- named_seat_flights %>%
  group_by(name) %>%
  summarise(seat_miles = sum(seats, na.rm =  T)*sum(distance,na.rm = T)) %>%
  rename(airline_name = name) %>%
  arrange(desc(seat_miles))
named_seat_miles_by_airline_name #not correct

View(named_seat_miles_by_airline_name)  



flights %>%                                # book solution
  inner_join(planes, by = "tailnum") %>%
  select(carrier, seats, distance) %>%
  mutate(ASM = seats * distance) %>%
  group_by(carrier) %>%
  summarize(ASM = sum(ASM, na.rm = TRUE)) %>%
  arrange(desc(ASM))**strong text**

The output of my code is

    # A tibble: 16 x 2
   airline_name           seat_miles
   <chr>                       <dbl>
 1 United Air Lines Inc.     8.73e14
 2 Delta Air Lines Inc.      4.82e14
 3 JetBlue Airways           4.13e14
 4 ExpressJet Airlines I~    9.82e13
 5 US Airways Inc.           3.83e13
 6 American Airlines Inc.    3.38e13
 7 Southwest Airlines Co.    2.10e13
 8 Endeavor Air Inc.         1.28e13
 9 Virgin America            1.19e13
10 AirTran Airways Corpo~    6.68e11
11 Alaska Airlines Inc.      2.24e11
12 Hawaiian Airlines Inc.    2.20e11
13 Frontier Airlines Inc.    1.17e11
14 Mesa Airlines Inc.        1.17e10
15 Envoy Air                 7.10e 9
16 SkyWest Airlines Inc.     4.08e 7

The output of books code is

# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 UA      15516377526
 2 DL      10532885801
 3 B6       9618222135
 4 AA       3677292231
 5 US       2533505829
 6 VX       2296680778
 7 EV       1817236275
 8 WN       1718116857
 9 9E        776970310
10 HA        642478122
11 AS        314104736
12 FL        219628520
13 F9        184832280
14 YV         20163632
15 MQ          7162420
16 OO          1299835

Also, I know I have airline names instead of carrier, but thats actually what was asked.

Upvotes: 0

Views: 78

Answers (1)

Jon Spring
Jon Spring

Reputation: 66445

The code replaces the sum of the products with the product of the sums.

Compare these:

...
filter(!is.na(seats)) %>%
summarise(seat_miles_sums = sum(seats, na.rm =  T)*sum(distance,na.rm = T),
          seat_miles = sum(seats*distance))
... 

Graphically, the question is asking for something like the areas below left, but your code calculates the area below right.

XXX      YY            XXXYY
XXX   +  YY      <     XXXYY
         YY            XXXYY

Upvotes: 1

Related Questions