Reputation: 12822
For object
data I can map two columns into a third, (object
) column of tuples
>>> import pandas as pd
>>> df = pd.DataFrame([["A","b"], ["A", "a"],["B","b"]])
>>> df
0 1
0 A b
1 A a
2 B b
>>> df.apply(lambda row: (row[0], row[1]), axis=1)
0 (A, b)
1 (A, a)
2 (B, b)
dtype: object
(see also Pandas: How to use apply function to multiple columns).
However, when I try to do the same thing with numerical columns
>>> df2 = pd.DataFrame([[10,2], [10, 1],[20,2]])
df2.apply(lambda row: (row[0], row[1]), axis=1)
0 1
0 10 2
1 10 1
2 20 2
so instead of a series of pairs (i.e. [(10,2), (10,1), (20,2)]
) I get a DataFrame
.
How can I force pandas
to actually get a series of pairs? (Preferably, doing it nicer than converting to string and then parsing.)
Upvotes: 2
Views: 646
Reputation: 375745
I don't recommend this, but you can force it:
In [11]: df2.apply(lambda row: pd.Series([(row[0], row[1])]), axis=1)
Out[11]:
0
0 (10, 2)
1 (10, 1)
2 (20, 2)
Two columns will give you much better performance, flexibility and ease of later analysis.
What was wanted was to count the occurrences of each [0, 1] pair.
In Series they could use the value_counts
method (with the column from the above result). However, the same result could be achieved using groupby and found to be 300 times faster (for OP):
df2.groupby([0, 1]).size()
It's worth emphasising (again) that [11]
has to create a Series object and a tuple instance for each row, which is a huge overhead compared to that of groupby.
Upvotes: 4