Reputation: 17
I understand the problem and showed all my work. I'm working through the modern dive data science book (https://moderndive.com/3-wrangling.html#joins book), and got stuck on (LC3.20) at the end of chapter 3.Using the nycflights13 package on R and dplyr, I'm supposed to generate a tibble that has only two columns, airline name and seat miles. Seat miles is just seats * miles. I understand the problem and I thought my code was going to output the correct result, however my seat miles are different for each airline carrier than in the solution. Can someone please help me to figure out why my code went wrong. Additionally, I do understand the books solution, I just don't know why my solution is wrong. I posted all my work.
#seat miles = miles*seats
View(flights) #distance and identifiers year and tail num and carrier
View(airlines) # names and indentifiers carrier
View(planes) #seats and identifiers year and tail num
#join names to flights
named_flights <- flights %>%
inner_join(airlines, by = 'carrier')
named_flights #same number of rows, all good
flights
#join seats to named_flights
named_seat_flights <- named_flights %>%
inner_join(planes, by = c('tailnum'))
named_seat_flights #noticed 52,596 rows are missing
#when joining tailnum to named_flights
table(is.na(select(named_flights, 'tailnum')))
#2512 rows that has NA values for tailnum in named_flights
table(is.na(select(planes, 'tailnum')))
#no tailnum data is missing from planes dataset
#and since a given plane (with a given tailnum)
#can take off multiple times per year
#we can conclude that the 52,596 missing rows
#are a result of the missing tailnum data in flights (also named_flights)
named_seat_miles_by_airline_name <- named_seat_flights %>%
group_by(name) %>%
summarise(seat_miles = sum(seats, na.rm = T)*sum(distance,na.rm = T)) %>%
rename(airline_name = name) %>%
arrange(desc(seat_miles))
named_seat_miles_by_airline_name #not correct
View(named_seat_miles_by_airline_name)
flights %>% # book solution
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
mutate(ASM = seats * distance) %>%
group_by(carrier) %>%
summarize(ASM = sum(ASM, na.rm = TRUE)) %>%
arrange(desc(ASM))**strong text**
The output of my code is
# A tibble: 16 x 2
airline_name seat_miles
<chr> <dbl>
1 United Air Lines Inc. 8.73e14
2 Delta Air Lines Inc. 4.82e14
3 JetBlue Airways 4.13e14
4 ExpressJet Airlines I~ 9.82e13
5 US Airways Inc. 3.83e13
6 American Airlines Inc. 3.38e13
7 Southwest Airlines Co. 2.10e13
8 Endeavor Air Inc. 1.28e13
9 Virgin America 1.19e13
10 AirTran Airways Corpo~ 6.68e11
11 Alaska Airlines Inc. 2.24e11
12 Hawaiian Airlines Inc. 2.20e11
13 Frontier Airlines Inc. 1.17e11
14 Mesa Airlines Inc. 1.17e10
15 Envoy Air 7.10e 9
16 SkyWest Airlines Inc. 4.08e 7
The output of books code is
# A tibble: 16 x 2
carrier ASM
<chr> <dbl>
1 UA 15516377526
2 DL 10532885801
3 B6 9618222135
4 AA 3677292231
5 US 2533505829
6 VX 2296680778
7 EV 1817236275
8 WN 1718116857
9 9E 776970310
10 HA 642478122
11 AS 314104736
12 FL 219628520
13 F9 184832280
14 YV 20163632
15 MQ 7162420
16 OO 1299835
Also, I know I have airline names instead of carrier, but thats actually what was asked.
Upvotes: 0
Views: 78
Reputation: 66445
The code replaces the sum of the products with the product of the sums.
Compare these:
...
filter(!is.na(seats)) %>%
summarise(seat_miles_sums = sum(seats, na.rm = T)*sum(distance,na.rm = T),
seat_miles = sum(seats*distance))
...
Graphically, the question is asking for something like the areas below left, but your code calculates the area below right.
XXX YY XXXYY
XXX + YY < XXXYY
YY XXXYY
Upvotes: 1