Reputation: 1511
I have two pandas dataframes representing each row a different author. There is also a column called 'publications' representing the list of publication_ids of that author which min_len = 1.
df_1 = pd.DataFrame({'publications':[[34499803], [34499125], [34445802, 7092834]]}, index=['0', '4', '2423'])
df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]}, index=['2234', '543', '345'])
How can I combine them so that the results look like this?
df_sum = pd.DataFrame({'publications':[[65499803, 56899232, 34499803], [78999821, 34499125], [87499234, 34445802, 7092834]]}, index=['0', '4', '2423'])
The order of the elements does not matter. I tried using + but I get np.NaN, also add but it complains about the types (TypeError: unsupported operand type(s) for +: 'float' and 'list')
Note: I edited the question as I realized the minimal example I provided was not capturing the problem which comes from the indices. As I am combining the two tables I only care about keeping df_1 indices
Upvotes: 1
Views: 864
Reputation: 862406
Here is different index values, so if length is same of both DataFrames add reset_index(drop=True)
:
df = df_1.reset_index(drop=True).add(df_2.reset_index(drop=True))
print (df)
publications
0 [34499803, 65499803, 56899232]
1 [34499125, 78999821]
2 [34445802, 7092834, 87499234]
If need same index like df_1
use:
df = df_1.add(df_2.set_index(df_1.index))
print (df)
publications
0 [34499803, 65499803, 56899232]
4 [34499125, 78999821]
2423 [34445802, 7092834, 87499234]
Upvotes: 1
Reputation: 482
well I was guessing that the index number are important
df_1 = pd.DataFrame({'publications':[[34499803], [34499125], [34445802, 7092834]]}, index=['0', '4', '2423'])
df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]}, index=['2234', '543', '345'])
df_1 = df_1.reset_index(drop=False)
df_2 = df_2.reset_index(drop=True)
df_sum = df_1
df_sum.publications = df_1.publications + df_2.publications
df_sum = df_sum.set_index('index')
publications
index
0 [34499803, 65499803, 56899232]
4 [34499125, 78999821]
2423 [34445802, 7092834, 87499234]
this way you keep the index but this also assumes that both df have the same length
Upvotes: 0
Reputation: 5599
I've managed to reproduce your problem by adding a singe value as float:
>>> df_1 = pd.DataFrame({'publications':[[34499803], float(34499125), [34445802, 7092834]]})
>>> df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]})
>>> df_1+df_2
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for +: 'float' and 'list'
if this is the case, it can be solved by transforming the single values into lists:
>>> df_1["publications"]=df_1["publications"].apply(lambda x: [x] if isinstance(x, float) else x)
>>> df_1+df_2
publications
0 [34499803, 65499803, 56899232]
1 [34499125.0, 78999821]
2 [34445802, 7092834, 87499234]
Upvotes: 0