Nikos P
Nikos P

Reputation: 31

Pandas: Updating dataframe column from other dataframes

I am facing a problem updating the datetime values on a column combining data from other dataframes. The first df includes the car data of a car rental company:

df_cars = pd.DataFrame([['1708092730', 'MERC', 'GLK270', 'PLT01', '', ''],
                        ['T170900224', 'MERC', 'C200', 'PLT02', '', ''],
                        ['1708092733', 'BMW', 'Χ3', 'PLT03', '', ''],
                        ['T170902781', 'BMW', 'X6', 'PLT04', '', ''],
                        ['T171002176', 'VW', 'PASSAT', 'PLT05', '', ''],
                        ['1708090817', 'FORD', 'MUSTANG', 'PLT06', '', '']],
                columns=['RefNo', 'Brand', 'Model', 'Plate', 'Driver', 'Date'])

The second df consists of the garage outbound records:

df_out = pd.DataFrame([['1708092733', 'John', '2018-1-2', 'MC', 250],
                       ['T170900224', 'Phil', '2018-1-15', 'AMEX', 400],
                       ['T170902781', 'Ann', '2018-1-22', 'VI', 400],
                       ['1708092730', 'Ann', '2018-1-28', 'AMEX', 200],
                       ['1708092733', 'Jeff', '2018-2-1', 'VI', 200]],
               columns=['RefNo', 'Driver', 'Date', 'Payment', 'Deposit'])

and the third df includes the garage inbound records:

df_in = pd.DataFrame([['1708092733', '2018-1-18', 165],
                      ['T170902781', '2018-2-12', 600],
                      ['1708092730', '2018-2-15', 150]], 
              columns=['RefNo', 'Date', 'Ammount'])

The RefNo is a unique barcode on car keys. I am trying to update the df_cars columns:

The following update code is working fine with a small number of rows

df_OutIn = pd.concat([df_out, df_in])                           # concatenate Garage out & in entries
df_OutIn['Date'] = pd.to_datetime(df_OutIn.Date)                # OutIn df Date column to datetile format 
df_OutIn = df_OutIn.sort_values(by='Date')                      # sort OutIn df by Date
df_OutIn = df_OutIn.drop_duplicates(['RefNo'], keep='last')     # remove duplicates and keep last update per car

df_tmp = df_cars[['RefNo', 'Driver']].set_index('RefNo')        # Update 'Driver' in  df_cars
df_tmp.update(df_OutIn.set_index('RefNo'))
df_cars['Driver'] = df_tmp.values

df_tmp = df_cars[['RefNo', 'Date']].set_index('RefNo')         # Update 'Date' in  df_cars
df_tmp.update(df_OutIn.set_index('RefNo'))
df_cars['Date'] = df_tmp.values
df_cars['Date'] = pd.to_datetime(df_cars.Date)

The df_cars is expected to be updated to

        RefNo Brand    Model  Plate Driver        Date
0  1708092730  MERC   GLK270  PLT01         2018-02-15   
1  T170900224  MERC     C200  PLT02   Phil  2018-01-15    
2  1708092733   BMW       Χ3  PLT03   Jeff  2018-02-01   
3  T170902781   BMW       X6  PLT04         2018-02-12   
4  T171002176    VW   PASSAT  PLT05                NaT
5  1708090817  FORD  MUSTANG  PLT06                NaT

When the whole dataset (25-30k records) is processed i'm getting the warning

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Are there any suggestions for improving this part of code? Thank you in advance!

Upvotes: 1

Views: 214

Answers (1)

Gabriel A
Gabriel A

Reputation: 1827

your code after drop duplicates can be replaced with.

df_cars = df_cars.merge(df_OutIn[["RefNo","Date","Driver"]],how="left",on="RefNo")

Because your df_cars has an empty column for Date and Driver the merge will add a suffix to the column names. To get around this, just remove those empty columns from df_cars.

Upvotes: 1

Related Questions