Reputation: 8090
I need to combine multiple Pandas Series
that contain string values. The series are messages that result from multiple validation steps. I try to combine these messages into 1 Series
to attach it to the DataFrame
. The problem is that the result is empty.
This is an example:
import pandas as pd
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})
index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index
series = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series += df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)
print series
# >>> series
# 0 NaN
# 1 NaN
Update
import pandas as pd
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})
index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index
series1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)
series3 = df.iloc[index2].apply(lambda x: x['a'] + '-ccc', axis=1)
# series3 causes a ValueError: cannot reindex from a duplicate axis
series = pd.concat([series1, series2, series3])
df['series'] = series
print df
Update2
In this example the indices seem to get mixed up.
import pandas as pd
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})
index1 = df[df['a'] == 'a'].index
index2 = df[df['a'] == 'b'].index
index3 = df[df['a'] == 'c'].index
series1 = df.iloc[index1].apply(lambda x: x['a'] + '-aaa', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-bbb', axis=1)
series3 = df.iloc[index3].apply(lambda x: x['a'] + '-ccc', axis=1)
print series1
print
print series2
print
print series3
print
df['series'] = pd.concat([series1, series2, series3], ignore_index=True)
print df
print
df['series'] = pd.concat([series2, series1, series3], ignore_index=True)
print df
print
df['series'] = pd.concat([series3, series2, series1], ignore_index=True)
print df
print
This results in this output:
0 a-aaa
dtype: object
1 b-bbb
dtype: object
2 c-ccc
dtype: object
a b series
0 a aa a-aaa
1 b bb b-bbb
2 c cc c-ccc
3 d dd NaN
a b series
0 a aa b-bbb
1 b bb a-aaa
2 c cc c-ccc
3 d dd NaN
a b series
0 a aa c-ccc
1 b bb b-bbb
2 c cc a-aaa
3 d dd NaN
I would expect only a's in row0, only b's in row1 and only c's in row2, but that's not the case...
Update 3
Here's a better example which should demonstrate the expected behaviour. As I said, the use case is that for a given DataFrame
, a function evaluates each row and possibly returns an error message for some of the rows as a Series
(some indexes are contained, some are not; if no error returns, the error series is empty).
In [12]:
s1 = pd.Series(['b', 'd'], index=[1, 3])
s2 = pd.Series(['a', 'b'], index=[0, 1])
s3 = pd.Series(['c', 'e'], index=[2, 4])
s4 = pd.Series([], index=[])
pd.concat([s1, s2, s3, s4]).sort_index()
# I'd like to get:
#
# 0 a
# 1 b b
# 2 c
# 3 d
# 4 e
Out[12]:
0 a
1 b
1 b
2 c
3 d
4 e
dtype: object
Upvotes: 2
Views: 6485
Reputation: 8090
I might have found a solution. I hope someone can comment on it...
s1 = pd.Series(['b', 'd'], index=[1, 3])
s2 = pd.Series(['a', 'b'], index=[0, 1])
s3 = pd.Series(['c', 'e'], index=[2, 4])
s4 = pd.Series([], index=[])
pd.concat([s1, s2, s3, s4]).sort_index()
df1 = pd.DataFrame(s1)
df2 = pd.DataFrame(s2)
df3 = pd.DataFrame(s3)
df4 = pd.DataFrame(s4)
d = pd.DataFrame({0:[]})
d = pd.merge(df1, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])
d = pd.merge(df2, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])
d = pd.merge(df3, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])
d = pd.merge(df4, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])
print d
which returns
0
0 a
1 bb
2 c
3 d
4 e
Upvotes: 0
Reputation: 394169
When concatenating the default is to use the existing indices, however if they collide then this will raise a ValueError
as you've found so you need to set ignore_index=True
:
In [33]:
series = pd.concat([series1, series2, series3], ignore_index=True)
df['series'] = series
print (df)
a b series
0 a aa bb-bbb
1 b bb a-aaa
2 c cc a-ccc
3 d dd NaN
EDIT
I think I know what you want now, you can achieve what you want by converting the series into a dataframe and then merging using the indices:
In [96]:
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})
index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index
series1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)
series3 = df.iloc[index2].apply(lambda x: x['a'] + '-ccc', axis=1)
# we now don't ignore the index in order to preserve the identity of the row we want to merge back to later
series = pd.concat([series1, series2, series3])
# construct a dataframe from the series and give the column a name
df1 = pd.DataFrame({'series':series})
# perform an outer merge on both df's indices
df.merge(df1, left_index=True, right_index=True, how='outer')
Out[96]:
a b series
0 a aa a-aaa
0 a aa a-ccc
1 b bb bb-bbb
2 c cc NaN
3 d dd NaN
Upvotes: 2
Reputation: 958
how about concat?
s1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
s2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)
s = pd.concat([s1,s2])
print s
1 bb-bbb
0 a-aaa
dtype: object
Upvotes: 0