G. Macia
G. Macia

Reputation: 1511

How to combine two pandas series that contain lists?

I have two pandas dataframes representing each row a different author. There is also a column called 'publications' representing the list of publication_ids of that author which min_len = 1.

df_1 = pd.DataFrame({'publications':[[34499803], [34499125], [34445802, 7092834]]}, index=['0', '4', '2423'])
df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]}, index=['2234', '543', '345'])

How can I combine them so that the results look like this?

df_sum = pd.DataFrame({'publications':[[65499803, 56899232, 34499803], [78999821, 34499125], [87499234, 34445802, 7092834]]}, index=['0', '4', '2423'])

The order of the elements does not matter. I tried using + but I get np.NaN, also add but it complains about the types (TypeError: unsupported operand type(s) for +: 'float' and 'list')

Note: I edited the question as I realized the minimal example I provided was not capturing the problem which comes from the indices. As I am combining the two tables I only care about keeping df_1 indices

Upvotes: 1

Views: 864

Answers (3)

jezrael
jezrael

Reputation: 862406

Here is different index values, so if length is same of both DataFrames add reset_index(drop=True):

df = df_1.reset_index(drop=True).add(df_2.reset_index(drop=True))

print (df)
                     publications
0  [34499803, 65499803, 56899232]
1            [34499125, 78999821]
2   [34445802, 7092834, 87499234]

If need same index like df_1 use:

df = df_1.add(df_2.set_index(df_1.index))

print (df)
                        publications
0     [34499803, 65499803, 56899232]
4               [34499125, 78999821]
2423   [34445802, 7092834, 87499234]

Upvotes: 1

gal peled
gal peled

Reputation: 482

well I was guessing that the index number are important

df_1 = pd.DataFrame({'publications':[[34499803], [34499125], [34445802, 7092834]]}, index=['0', '4', '2423'])
df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]}, index=['2234', '543', '345'])
df_1 = df_1.reset_index(drop=False)
df_2 = df_2.reset_index(drop=True)
df_sum = df_1
df_sum.publications = df_1.publications + df_2.publications
df_sum = df_sum.set_index('index')

                         publications
index                                
0      [34499803, 65499803, 56899232]
4                [34499125, 78999821]
2423    [34445802, 7092834, 87499234]

this way you keep the index but this also assumes that both df have the same length

Upvotes: 0

Theofilos Papapanagiotou
Theofilos Papapanagiotou

Reputation: 5599

I've managed to reproduce your problem by adding a singe value as float:

>>> df_1 = pd.DataFrame({'publications':[[34499803], float(34499125), [34445802, 7092834]]})
>>> df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]})
>>> df_1+df_2
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for +: 'float' and 'list'

if this is the case, it can be solved by transforming the single values into lists:

>>> df_1["publications"]=df_1["publications"].apply(lambda x: [x] if isinstance(x, float) else x)
>>> df_1+df_2
                     publications
0  [34499803, 65499803, 56899232]
1          [34499125.0, 78999821]
2   [34445802, 7092834, 87499234]

Upvotes: 0

Related Questions