Reputation: 21
I'm trying to using pandas for chaining together map and filter operations. I've come across several options, partly outlined in here: Pandas How to filter a Series
To summarize,
s = Series(range(10))
s.where(s > 4).dropna()
s.where(lambda x: x > 4).dropna()
s.loc[s > 4]
s.loc[lambda x: x > 4]
s.to_frame(name='x').query("x > 4")
This is fine for numerical comparisons and equality checks, but it doesn't work for predicates involving other operations. For a simple example, consider matching against the first character of a string.
s = Series(['aa', 'ab', 'ba'])
s.loc[lambda x: x.startswith('a')] # fails
This fails with a message like "Series has no attribute 'startswith'" since the argument x
passed to the lambda expression in the second line is the series itself, rather than the individual elements it contains.
Interestingly map
does allow element-wise access:
Series(list('abcd')).map(lambda x: x.upper())
# results in ['A', 'B', 'C', 'D'] even though Series has no upper method
While there's probably some clever ways to handle the startswith
example, I'm hoping to find a more general solution where a series can be filtered using a function that accepts individual values from the collection. And ideally it would allow chaining together operations as in,
s = (Series(...)
.map(...)
.where(...)
.map(...))
Is that supported in pandas?
UPDATE:
Scott provided the answer for cases where the value is a string, which can be handled with Series.str
as described in his answer.
But what about cases with a Series containing objects? Is there any way to access their attributes or apply functions to them?
I guess a standard way of managing that case would be to de-structure the the relevant fields of the object into a data frame, where each attribute is a column. Though there might be cases where someone would want to transform a collection of objects with map and filter(loc/where), without having to disassemble the complex type into a dataframe then immediately convert back.
I'm partly trying to find an alternative to the standard map()/filter() functions in python, where the operations have to be nested in reverse.
Ie,
map(function3, filter(function2, map(function1, collection)))
Upvotes: 2
Views: 5673
Reputation: 1320
s = Series(['aa', 'ab', 'ba'])
s.loc[lambda x: x.startswith('a')] # fails
This fails because .loc[]
expects a Series/array of True/False values, which the supplied lambda function fails to provide. Easy solution to this, that works in the general case, is to first use .map()
to apply the condition to each element, then supply the resulting boolean array to .loc[]
. Like this:
s = Series(['aa', 'ab', 'ba'])
s.loc[s.map(lambda x: x.startswith('a'))]
Upvotes: 0
Reputation: 153510
Use the .str
the string accessor needed for Pandas series and string operations.
s = Series(['aa', 'ab', 'ba'])
s.loc[lambda x: x.str.startswith('a')]
When you are using map, you are apply the string function to each element therefore you don't need the string accessor.
And to @piRSquared's point in the comments, you don't need lambda at all, you can use boolean indexing.
s = pd.Series(['aa', 'ab', 'ba'])
s.loc[s.str.startswith('a')]
s.str.startswith
returns a True False boolean series which when placed in backets for a series returns only those values that align with True.
Upvotes: 2