orange
orange

Reputation: 8090

Combining Series in Pandas

I need to combine multiple Pandas Series that contain string values. The series are messages that result from multiple validation steps. I try to combine these messages into 1 Series to attach it to the DataFrame. The problem is that the result is empty.

This is an example:

import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})

index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index

series = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series += df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)

print series
# >>> series
# 0    NaN
# 1    NaN

Update

import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})

index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index

series1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)
series3 = df.iloc[index2].apply(lambda x: x['a'] + '-ccc', axis=1)

# series3 causes a ValueError: cannot reindex from a duplicate axis
series = pd.concat([series1, series2, series3])
df['series'] = series
print df

Update2

In this example the indices seem to get mixed up.

import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})

index1 = df[df['a'] == 'a'].index
index2 = df[df['a'] == 'b'].index
index3 = df[df['a'] == 'c'].index

series1 = df.iloc[index1].apply(lambda x: x['a'] + '-aaa', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-bbb', axis=1)
series3 = df.iloc[index3].apply(lambda x: x['a'] + '-ccc', axis=1)

print series1
print
print series2
print
print series3
print

df['series'] = pd.concat([series1, series2, series3], ignore_index=True)
print df
print

df['series'] = pd.concat([series2, series1, series3], ignore_index=True)
print df
print

df['series'] = pd.concat([series3, series2, series1], ignore_index=True)
print df
print

This results in this output:

0    a-aaa
dtype: object

1    b-bbb
dtype: object

2    c-ccc
dtype: object

   a   b series
0  a  aa  a-aaa
1  b  bb  b-bbb
2  c  cc  c-ccc
3  d  dd    NaN

   a   b series
0  a  aa  b-bbb
1  b  bb  a-aaa
2  c  cc  c-ccc
3  d  dd    NaN

   a   b series
0  a  aa  c-ccc
1  b  bb  b-bbb
2  c  cc  a-aaa
3  d  dd    NaN

I would expect only a's in row0, only b's in row1 and only c's in row2, but that's not the case...

Update 3

Here's a better example which should demonstrate the expected behaviour. As I said, the use case is that for a given DataFrame, a function evaluates each row and possibly returns an error message for some of the rows as a Series (some indexes are contained, some are not; if no error returns, the error series is empty).

In [12]:

s1 = pd.Series(['b', 'd'], index=[1, 3])
s2 = pd.Series(['a', 'b'], index=[0, 1])
s3 = pd.Series(['c', 'e'], index=[2, 4])
s4 = pd.Series([], index=[])
pd.concat([s1, s2, s3, s4]).sort_index()

# I'd like to get:
#
# 0    a
# 1    b b
# 2    c
# 3    d
# 4    e
Out[12]:
0    a
1    b
1    b
2    c
3    d
4    e
dtype: object

Upvotes: 2

Views: 6485

Answers (3)

orange
orange

Reputation: 8090

I might have found a solution. I hope someone can comment on it...

s1 = pd.Series(['b', 'd'], index=[1, 3])
s2 = pd.Series(['a', 'b'], index=[0, 1])
s3 = pd.Series(['c', 'e'], index=[2, 4])
s4 = pd.Series([], index=[])
pd.concat([s1, s2, s3, s4]).sort_index()


df1 = pd.DataFrame(s1)
df2 = pd.DataFrame(s2)
df3 = pd.DataFrame(s3)
df4 = pd.DataFrame(s4)

d = pd.DataFrame({0:[]})
d = pd.merge(df1, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])

d = pd.merge(df2, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])

d = pd.merge(df3, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])

d = pd.merge(df4, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])
print d

which returns

    0
0   a
1  bb
2   c
3   d
4   e

Upvotes: 0

EdChum
EdChum

Reputation: 394169

When concatenating the default is to use the existing indices, however if they collide then this will raise a ValueError as you've found so you need to set ignore_index=True:

In [33]:

series = pd.concat([series1, series2, series3], ignore_index=True)
df['series'] = series
print (df)
   a   b  series
0  a  aa  bb-bbb
1  b  bb   a-aaa
2  c  cc   a-ccc
3  d  dd     NaN

EDIT

I think I know what you want now, you can achieve what you want by converting the series into a dataframe and then merging using the indices:

In [96]:

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})

index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index

series1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)
series3 = df.iloc[index2].apply(lambda x: x['a'] + '-ccc', axis=1)
# we now don't ignore the index in order to preserve the identity of the row we want to merge back to later
series = pd.concat([series1, series2, series3])
# construct a dataframe from the series and give the column a name
df1 = pd.DataFrame({'series':series})
# perform an outer merge on both df's indices
df.merge(df1, left_index=True, right_index=True, how='outer')

Out[96]:
   a   b  series
0  a  aa   a-aaa
0  a  aa   a-ccc
1  b  bb  bb-bbb
2  c  cc     NaN
3  d  dd     NaN

Upvotes: 2

Ankush Shah
Ankush Shah

Reputation: 958

how about concat?

s1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
s2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)


s = pd.concat([s1,s2])
print s

1    bb-bbb
0    a-aaa
dtype: object

Upvotes: 0

Related Questions