kshnkvn
kshnkvn

Reputation: 956

How to combine several csv in one with identical rows?

I have several csv files with approximately the following structure:

name,title,status,1,2,3
name,title,status,4,5,6
name,title,status,7,8,9

Most of the name columns is the same in all files, only the columns 1,2,3,4... are different. I need to take turns adding new columns to existing and new rows, as well as updating the remaining rows each time. For example, I have 2 tables:

name,title,status,1,2,3
Foo,Bla-bla-bla,10,45.6,12.3,45.2
Bar,Too-too,13,13.4,22.6,75.1
name,title,status,4,5,6
Foo,Bla-bla-bla,14,25.3,125.3,5.2
Fobo,Dom-dom,20,53.4,2.9,11.3

And at the output I expect a table:

name,title,status,1,2,3,4,5,6
Foo,Bla-bla-bla,14,45.6,12.3,45.2,25.3,125.3,5.2
Bar,Too-too,13,13.4,22.6,75.1,,,
Fobo,Dom-dom,20,,,,53.4,2.9,11.3

I did not find anything similar, who can tell how I can do this?

Upvotes: 1

Views: 45

Answers (2)

piRSquared
piRSquared

Reputation: 294258

It looks like you want to keep just one version of ['name', 'title', 'status'] and from your example, you prefer to keep the last 'status' encountered.

I'd use pd.concat and follow that up with a groupby to filter out duplicate status.

df = pd.concat([
    pd.read_csv(fp, index_col=['name', 'title', 'status'])
    for fp in ['data1.csv', 'data2.csv']
], axis=1).reset_index('status').groupby(level=['name', 'title']).last()

df

                  status     1     2     3     4      5     6
name title                                                   
Bar  Too-too          13  13.4  22.6  75.1   NaN    NaN   NaN
Fobo Dom-dom          20   NaN   NaN   NaN  53.4    2.9  11.3
Foo  Bla-bla-bla      14  45.6  12.3  45.2  25.3  125.3   5.2

Then df.to_csv() produces

name,title,status,1,2,3,4,5,6
Bar,Too-too,13,13.4,22.6,75.1,,,
Fobo,Dom-dom,20,,,,53.4,2.9,11.3
Foo,Bla-bla-bla,14,45.6,12.3,45.2,25.3,125.3,5.2

Upvotes: 2

Code Different
Code Different

Reputation: 93161

Keep merging them:

df = None
for path in ['data1.csv', 'data2.csv']:
    sub_df = pd.read_csv(path)

    if df is None:
        df = sub_df
    else:
        df = df.merge(sub_df, on=['name', 'title', 'status'], how='outer')

Upvotes: 0

Related Questions