C.G
C.G

Reputation: 87

Multiple Column + Conditional Joins

I have two data frames that I would like to join together (bike rides data frame and a bike stations data frame).

I have been working with pandas library but I can't seem to write the code to manipulate the join perfectly. Initially, I was just joining on the key "station_id" but I found a more recently updated stations data set that included more stations, the problem is there are some stations that do not have a station_id. For those stations, I wanted to join on matching the latitude and longitude coordinates.

Initial code for when I was just using station_id to join the data frames

rides_df = rides_df.rename(columns = {'start_station_id': 'station_id'})
rides_df = rides_df.merge(stations_df[['station_id','station_name']],
                         on = 'station_id', how = 'left')
rides_df = rides_df.rename(columns = {'station_id':'start_station_id',
                                     'station_name':'station_name_start'})

#merge ending station name
rides_df = rides_df.rename(columns = {'end_station_id': 'station_id'})
rides_df = rides_df.merge(stations_df[['station_id', 'station_name']],
                         on = 'station_id', how = 'left')
rides_df = rides_df.rename(columns = {'station_id':'end_station_id',
                                     'station_name': 'station_name_end'})

The rides data frame is structured as follows (sampled):

rides_df = pd.DataFrame([[1912818,'Round Trip',3014,34.0566101,-118.23721,3014,34.0566101,-118.23721],
                 [1933383,'Round Trip',3016,34.0528984,-118.24156,3016,34.0528984,-118.24156],
                  [1944197,'Round Trip',3016,34.0528984,-118.24156,3016,34.0528984,-118.24156],
                  [1940317,'Round Trip','NaN',34.03352,-118.24184,'NaN',34.03352,-118.24184],
                  [1944075,'One Way',3021,34.0456085,-118.23703,3016,34.0566101,-118.23721]]
                 , columns = ['trip_id','trip_route_category','start_station_id','start_lat',
                              'start_lon','end_station_id','end_lat','end_lon'])

enter image description here

The stations data frame is structured as follows (sampled):

stations_df = pd.DataFrame([['Union Station West Portal',34.05661,-118.23721,3014],
                            ['Los Angeles & Temple',34.0529,-118.24156,3016],
                            ['Grand & Olympic',34.04373,-118.26014,3018],
                            ['12th & Hill',34.03861,-118.26086,3019],
                            ['Hill & Washington',34.03105,-118.26709,3020],
                            ['Row DTLA',34.03352,-118.24184,'NaN']], 
                           columns = ['station_name', 'lat', 'lon','station_id'])

enter image description here What I want is to add the station name for the starting location and ending location on the rides data frame so I would have a column for "Start_Station_Name" and "End_Station_Name". I would want to join on "station_id" but if station_id is NaN then to match lat&lon for both start and end.

The data frame that I want as a result is structured as follows:
want_df = pd.DataFrame([[1912818,'Round Trip','Union Station West Portal',3014,34.0566101,-118.23721,'Union Station West Portal',3014,34.0566101,-118.23721],
                 [1933383,'Round Trip','Los Angeles & Temple',3016,34.0528984,-118.24156,'Los Angeles & Temple',3016,34.0528984,-118.24156],
                  [1944197,'Round Trip','Los Angeles & Temple',3016,34.0528984,-118.24156,'Los Angeles & Temple',3016,34.0528984,-118.24156],
                  [1940317,'Round Trip','Row DTLA','Nan',34.03352,-118.24184,'Row DTLA','Nan',34.03352,-118.24184],
                [1944075,'One Way','NaN',3021,34.0456085,-118.23703,'Los Angeles & Temple',3016,34.0566101,-118.23721]]
                 , columns = ['trip_id','trip_route_category','start_station_name','start_station_id','start_lat',
                              'start_lon','end_station_name','end_station_id','end_lat','end_lon'])

enter image description here

Upvotes: 1

Views: 75

Answers (1)

Akaisteph7
Akaisteph7

Reputation: 6496

Here is the updated verison of your code to achieve this:

# rides_df and station_df are slightly modified to make sure that the code works as intended
rides_df = pd.DataFrame([[1912818,'Round Trip',3014,34.0566101,-118.23721,3014,34.0566101,-118.23721],
                 [1933383,'Round Trip',3016,34.0528984,-118.24156,3016,34.0528984,-118.24156],
                  [1944197,'Round Trip',3016,34.0528984,-118.24156,3016,34.0528984,-118.24156],
                  [1940317,'Round Trip','NaN' ,34.03352,-118.24184,3018,34.03352,-118.24184],
                  [1944075,'One Way',3021,34.0456085,-118.23703,3016,34.0566101,-118.23721]]
                 , columns = ['trip_id','trip_route_category','start_station_id','start_lat',
                              'start_lon','end_station_id','end_lat','end_lon'])                          

stations_df = pd.DataFrame([['Union Station West Portal',34.05661,-118.23721,'NaN'],
                            ['Los Angeles & Temple',34.0529,-118.24156,3016],
                            ['Grand & Olympic',34.04373,-118.26014,3018],
                            ['12th & Hill',34.03861,-118.26086,3019],
                            ['Hill & Washington',34.03105,-118.26709,3020],
                            ['Row DTLA',34.03352,-118.24184,'NaN']], 
                           columns = ['station_name', 'lat', 'lon','station_id'])


# Convert to floats to match NaNs
rides_df[["start_station_id", "end_station_id"]] = rides_df[["start_station_id", "end_station_id"]].astype(float)
stations_df["station_id"] = stations_df["station_id"].astype(float)
# Convert the NaNs to another invalid id so they stop matching on merge
stations_df.loc[stations_df["station_id"].isnull(), "station_id"] = -1
# Round so numbers are an exact match
rides_df = rides_df.round(5)

# Merge beginning station name
rides_df = rides_df.rename(columns = {'start_station_id': 'station_id', 
                                      'start_lat': 'lat', 'start_lon': 'lon'})
rides_df = rides_df.merge(stations_df[['station_id','station_name']],
                         on = 'station_id', how = 'left')
# Merge again by looking at lat/lon values
rides_df = rides_df.merge(stations_df[['lat', 'lon','station_name']],
                         on = ['lat', 'lon'], how = 'left')
# Merge the two merge results
rides_df.loc[:, "station_name"] = rides_df["station_name_x"].combine(rides_df["station_name_y"], lambda x,y: x if not x!=x else y)
rides_df.drop(["station_name_x", "station_name_y"], axis=1, inplace=True)
rides_df = rides_df.rename(columns = {'station_id':'start_station_id',
                                     'station_name':'start_station_name', 
                                     'lat':'start_lat', 'lon':'start_lon'})

# Merge ending station name
rides_df = rides_df.rename(columns = {'end_station_id': 'station_id',
                                      'start_lat': 'lat', 'start_lon': 'lon'})
rides_df = rides_df.merge(stations_df[['station_id', 'station_name']],
                         on = 'station_id', how = 'left')
rides_df = rides_df.merge(stations_df[['lat', 'lon','station_name']],
                         on = ['lat', 'lon'], how = 'left')
rides_df.loc[:, "station_name"] = rides_df["station_name_x"].combine(rides_df["station_name_y"], lambda x,y: x if not x!=x else y)
rides_df.drop(["station_name_x", "station_name_y"], axis=1, inplace=True)
rides_df = rides_df.rename(columns = {'station_id':'end_station_id',
                                     'station_name': 'end_station_name',
                                     'lat':'start_lat', 'lon':'start_lon'})

print(rides_df)

Output:

   trip_id trip_route_category  start_station_id  start_lat  start_lon  end_station_id   end_lat    end_lon         start_station_name           end_station_name
0  1912818          Round Trip            3014.0   34.05661 -118.23721          3014.0  34.05661 -118.23721  Union Station West Portal  Union Station West Portal
1  1933383          Round Trip            3016.0   34.05290 -118.24156          3016.0  34.05290 -118.24156       Los Angeles & Temple       Los Angeles & Temple
2  1944197          Round Trip            3016.0   34.05290 -118.24156          3016.0  34.05290 -118.24156       Los Angeles & Temple       Los Angeles & Temple
3  1940317          Round Trip               NaN   34.03352 -118.24184          3018.0  34.03352 -118.24184                   Row DTLA            Grand & Olympic
4  1944075             One Way            3021.0   34.04561 -118.23703          3016.0  34.05661 -118.23721                        NaN       Los Angeles & Temple

Upvotes: 1

Related Questions