Reputation: 452
I have a pandas DataFrame that contains some array columns. What is the recommended way to index some of these columns by different position indices? For example, from the array column named l
I need the second elements, from the array column named a
I need the first elements. The results should be a new DataFrame. Array column can either contain a Python list or a Numpy array, but this probably does not matter.
I have three solutions, but I don't really like any of them.
df= pd.DataFrame({'l': [[1, 2, 4], [3, 2, 0, 10]], \
'a':[np.array(["foo", "bar", "baz"]), np.array(["qux", "quux"])], \
'dontcare': [10, 20]})
l a dontcare
0 [1, 2, 4] [foo, bar, baz] 10
1 [3, 2, 0, 10] [qux, quux] 20
Solution 1, with str
and join
df['l'].str[1].to_frame('l').join(df['a'].str[0])
l a
0 2 foo
1 2 qux
Solution 2, with the function apply
and creating Series
df.apply(lambda row: pd.Series([row['l'][1], row['a'][0]], index=['l', 'a']), axis=1)
Solution 3, with apply
and broadcast
.
df[['l', 'a']].apply(lambda row: [row['l'][1], row['a'][0]], axis=1, result_type='broadcast')
We can assume that the output column names match the input column names and we dont need multiple elements of any array column.
Upvotes: 1
Views: 3042
Reputation: 862441
I think it depends.
First solution is most general, working always if indices not exist - then returned NaN
s. But it is also reason for slowiest solution if large DataFrame
.
print (df['l'].str[3].to_frame('l').join(df['a'].str[2]))
l a
0 NaN baz
1 10.0 NaN
Another solution with apply
should be faster, but fail, if value not exist.
print (df.apply(lambda row: pd.Series([row['l'][3], row['a'][2]], index=['l', 'a']), axis=1))
IndexError: ('list index out of range', 'occurred at index 0')
Another idea if always exist values in lists is use list comprehension (but fail similar like apply, if not exist) with *c
for tail:
df= pd.DataFrame({'l': [[1, 2, 4], [3, 2, 0, 10]], \
'a':[np.array(["foo", "bar", "baz"]), np.array(["qux", "quux"])], \
'dontcare': [10, 20],
's': [10, 20],
'b': [10, 20]})
print (df)
l a dontcare s b
0 [1, 2, 4] [foo, bar, baz] 10 10 10
1 [3, 2, 0, 10] [qux, quux] 20 20 20
df = pd.DataFrame([(a[1], b[0]) for a,b, *c in df.values], columns=['l', 'a'])
print (df)
l a
0 2 foo
1 2 qux
Or select column by list
for processing:
df = pd.DataFrame([(a[1], b[0]) for a,b in df[['l','a']].values], columns=['l', 'a'])
Upvotes: 1