Vincent
Vincent

Reputation: 1614

Python + Pandas : Update ONE column in csv based on another csv

I have two similar tables ("hist.csv") :

Historical :
id | url | url2 | url3 | Time
1    A      B      C      5
2    D      E      F      8

And ("new.csv") :

New :
id | url | url2 | url3 | Time
1    A      Z      K      9
2    G      H      I      11

I want to update the New.Time column with the Historical.Time value, if "url" columns match. i.e. desired output here with url "A" updated :

New2 :
id | url | url2 | url3 | Time
1    A      Z      K      5
2    G      H      I      11

I tried the following :

Historical = pd.DataFrame.from_csv("hist.csv", index_col='id', sep='\t', encoding='utf-8')
New = pd.DataFrame.from_csv("new.csv", index_col='id', sep='\t', encoding='utf-8')

for index, row in New.iterrows():
    New.loc[index,'Time']=Historical.loc[historical['url'] == row['url'],'Time']

New.to_csv("new2.csv", sep='\t', encoding='utf-8')

Raising :

 ValueError: Must have equal len keys and value when setting with an iterable

PS : I found this thread : Updating a DataFrame based on another DataFrame But it looks like the proposed solution with "merge" does not really fit my needs, as I have many columns ?

Upvotes: 2

Views: 3206

Answers (1)

Anand S Kumar
Anand S Kumar

Reputation: 90999

The basic issue is that Historical.loc[Historical['url'] == row['url'],'Time'] returns a Series (even if there is only one row or no row where the condition - Historical['url'] == row['url'] -matches). Example -

In [15]: df
Out[15]:
   A  B
0  1  2
1  2  3

In [16]: df.loc[df['A']==1,'B']
Out[16]:
0    2
Name: B, dtype: int64

And then you try to set this DataFrame into a single cell of your New dataframe, which is what is causing the issue.

Since in the comments you say -

I may have several rows with "url" in Historical, but they will have the same Time value. In that case, I should consider the first occurence/match.

A quick fix for your code would be to check whether row['url'] exists in the other DataFrame, and only if true, to get the value from it using -

for index, row in New.iterrows():
    if row['url'] in Historical['url'].values:
        row['Time']=Historical.loc[Historical['url'] == row['url'],'Time'].values[0]

Upvotes: 1

Related Questions