Mapping a few numerical columns into a new columns of tuples in Pandas

Question

For object data I can map two columns into a third, (object) column of tuples

>>> import pandas as pd
>>> df = pd.DataFrame([["A","b"], ["A", "a"],["B","b"]])
>>> df
   0  1
0  A  b
1  A  a
2  B  b

>>> df.apply(lambda row: (row[0], row[1]), axis=1)
0    (A, b)
1    (A, a)
2    (B, b)
dtype: object

(see also Pandas: How to use apply function to multiple columns).

However, when I try to do the same thing with numerical columns

>>> df2 = pd.DataFrame([[10,2], [10, 1],[20,2]])
df2.apply(lambda row: (row[0], row[1]), axis=1)
     0     1
0    10    2
1    10    1
2    20    2

so instead of a series of pairs (i.e. [(10,2), (10,1), (20,2)]) I get a DataFrame.

How can I force pandas to actually get a series of pairs? (Preferably, doing it nicer than converting to string and then parsing.)

Andy Hayden · Accepted Answer

I don't recommend this, but you can force it:

In [11]: df2.apply(lambda row: pd.Series([(row[0], row[1])]), axis=1)
Out[11]:
         0
0  (10, 2)
1  (10, 1)
2  (20, 2)

Please don't do this.

Two columns will give you much better performance, flexibility and ease of later analysis.

Just to update with the OP's experience:

What was wanted was to count the occurrences of each [0, 1] pair.

In Series they could use the value_counts method (with the column from the above result). However, the same result could be achieved using groupby and found to be 300 times faster (for OP):

df2.groupby([0, 1]).size()

It's worth emphasising (again) that [11] has to create a Series object and a tuple instance for each row, which is a huge overhead compared to that of groupby.

Mapping a few numerical columns into a new columns of tuples in Pandas

Answers (1)

Please don't do this.

Just to update with the OP's experience:

Related Questions