dwilson
dwilson

Reputation: 37

Merging Dataframe1 rows into Dataframe2 using existing column headers - Python Pandas

I am attempting to build a python dataframe to capture an initial dataset, and then update fields as csv files containing data become available in a specified folder location. I have built my initial table and saved it to a csv, but when I reopen the table and try to bring in the next text file data I am accidentally duplicating the headers for the new data everytime I run the code. What is the best way to read in the new text files and append the data to a row based on a common column name?

My current attempt is to build the new text files into their own database, match the column headers from the initial import, and merge on the common header with the starting dataset. However, like I said the columns just keep duplicating themselves and adding _x, _y etc. afterwards.

Sample of Initial Dataframe...

Line Name,Bearing,SOL X,SOL Y,EOL X,EOL Y,FSP 0.5m,FSP 1m,Length km,Length m,LSP 0.5m,LSP 1m,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
A901,38.67269998,568453.03,4343701.73,569156.01,4344580.05,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,
A902,38.67269998,568476.45,4343682.99,569179.43,4344561.31,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,
A903,38.67269998,568499.87,4343664.24,569202.85,4344542.56,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,

Sample of Data to be Merged...

,Line Name,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
0,A901,9/26/2019,269,37,0037_JD269_X331 - 0001.db,09:29:54,2.73,12792,128.67,1000.0,587985.95,4380278.68,587811.22,4380427.4,09:43:51,15594,3.28,588766.68,4379653.81,588901.53,4379545.64
1,A902,9/26/2019,269,38,0038_JD269_M104 - 0001.db,11:38:24,3.69,98260,218.67,42875.0,591391.62,4383593.91,591626.99,4383892.87,17:40:02,25764,3.02,564600.29,4350120.19,568980.53,4355589.75
2,A903,9/29/2019,273,80,0080_JD273_M305 - 0001.db,00:50:53,3.64,27721,38.67,1125.0,576455.88,4351038.29,576365.15,4350932.06,01:03:26,30615,3.78,577158.86,4351916.61,577275.9,4352057.35

I've tried merging, appending and concatenating with varying combinations of rules applied, but without any luck. Some error, others duplicate the columns...

df_proj = pd.concat([df_in,df_acc], axis=1,sort=False)

or

df_proj = pd.merge(df_in, df_acc, on='Line Name', how = 'right')

I've tried to find examples of this type of problem without luck, I be going about this the wrong way. Maybe I shouldn't be treating the new text files as their own dataframe to begin with. But any help would be greatly appreciated. Thank you!

EDIT:

The expected result would look like this...

Line Name,Bearing,SOL X,SOL Y,EOL X,EOL Y,FSP 0.5m,FSP 1m,Length km,Length m,LSP 0.5m,LSP 1m,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
A901,38.67269998,568453.03,4343701.73,569156.01,4344580.05,250,125,1.125,1125,2500,1250,9/26/2019,269,37,0037_JD269_X331 - 0001.db,09:29:54,2.73,12792,128.67,1000.0,587985.95,4380278.68,587811.22,4380427.4,09:43:51,15594,3.28,588766.68,4379653.81,588901.53,4379545.64
A902,38.67269998,568476.45,4343682.99,569179.43,4344561.31,250,125,1.125,1125,2500,1250,9/26/2019,269,38,0038_JD269_M104 - 0001.db,11:38:24,3.69,98260,218.67,42875.0,591391.62,4383593.91,591626.99,4383892.87,17:40:02,25764,3.02,564600.29,4350120.19,568980.53,4355589.75
A903,38.67269998,568499.87,4343664.24,569202.85,4344542.56,250,125,1.125,1125,2500,1250,9/29/2019,273,80,0080_JD273_M305 - 0001.db,00:50:53,3.64,27721,38.67,1125.0,576455.88,4351038.29,576365.15,4350932.06,01:03:26,30615,3.78,577158.86,4351916.61,577275.9,4352057.35

Upvotes: 1

Views: 192

Answers (3)

jezrael
jezrael

Reputation: 863611

Solution if need replace missing values by another DataFrame:

df = (df_in.set_index('Line Name')
           .combine_first(df_acc.set_index('Line Name'))
           .reset_index()
           .reindex(columns=df_in.columns))
print (df)
  Line Name  Bearing      SOL X       SOL Y      EOL X       EOL Y  FSP 0.5m  \
0      A901  38.6727  568453.03  4343701.73  569156.01  4344580.05       250   
1      A902  38.6727  568476.45  4343682.99  569179.43  4344561.31       250   
2      A903  38.6727  568499.87  4343664.24  569202.85  4344542.56       250   

   FSP 1m  Length km  Length m  ...  SOL Northing  Obs SOL X   Obs SOL Y  \
0     125      1.125      1125  ...    4380278.68  587811.22  4380427.40   
1     125      1.125      1125  ...    4383593.91  591626.99  4383892.87   
2     125      1.125      1125  ...    4351038.29  576365.15  4350932.06   

   Time End  Fix End SOG End Planned EOL X  Planned EOL Y  Obs EOL X  \
0  09:43:51  15594.0    3.28     588766.68     4379653.81  588901.53   
1  17:40:02  25764.0    3.02     564600.29     4350120.19  568980.53   
2  01:03:26  30615.0    3.78     577158.86     4351916.61  577275.90   

    Obs EOL Y  
0  4379545.64  
1  4355589.75  
2  4352057.35  

[3 rows x 32 columns]

If need concat together by Line Name with remove only NaNs columns:

df_proj = (pd.concat([df_in.set_index('Line Name').dropna(how='all', axis=1),
                     df_acc.set_index('Line Name')], axis=1, sort=False)
             .reset_index())
print (df_proj)
  Line Name  Bearing      SOL X       SOL Y      EOL X       EOL Y  FSP 0.5m  \
0      A901  38.6727  568453.03  4343701.73  569156.01  4344580.05       250   
1      A902  38.6727  568476.45  4343682.99  569179.43  4344561.31       250   
2      A903  38.6727  568499.87  4343664.24  569202.85  4344542.56       250   

   FSP 1m  Length km  Length m  ...  SOL Northing  Obs SOL X   Obs SOL Y  \
0     125      1.125      1125  ...    4380278.68  587811.22  4380427.40   
1     125      1.125      1125  ...    4383593.91  591626.99  4383892.87   
2     125      1.125      1125  ...    4351038.29  576365.15  4350932.06   

   Time End  Fix End SOG End Planned EOL X  Planned EOL Y  Obs EOL X  \
0  09:43:51    15594    3.28     588766.68     4379653.81  588901.53   
1  17:40:02    25764    3.02     564600.29     4350120.19  568980.53   
2  01:03:26    30615    3.78     577158.86     4351916.61  577275.90   

    Obs EOL Y  
0  4379545.64  
1  4355589.75  
2  4352057.35  

[3 rows x 32 columns]

EDIT:

For check duplicates:

print (df_in[df_in['Line Name'].duplicated(keep=False)])
print (df_acc[df_acc['Line Name'].duplicated(keep=False)])

For remove duplicates by Line Name use:

df_in = df_in.drop_duplicates('Line Name')
df_acc = df_acc.drop_duplicates('Line Name')

Upvotes: 1

Rajith Thennakoon
Rajith Thennakoon

Reputation: 4130

lets say your starting dataframe as df_in

columns = df_in.columns
df_proj = pd.concat([df_in,df_acc[columns]], axis=0,sort=False)

Upvotes: 1

oreopot
oreopot

Reputation: 3450

You could do something like following:

#using append 
final_df = df_in.append(df_acc)

OR

use axis=0, meaning row wise

df_proj = pd.concat([df_in,df_acc], axis=0,sort=False)

Upvotes: 0

Related Questions