Reputation: 64014
I have the following data frame:
import pandas as pd
import io
temp=u"""probe,genes,sample1
1415777_at,Pnliprp1 +OX(M6),20
1415777_at,Pllk +C6,20
1415884_at,Cela3b,47"""
df = pd.read_csv(io.StringIO(temp))
df
Which looks like this:
Out[23]:
probe genes sample1
0 1415777_at Pnliprp1 +OX(M6) 20
1 1415777_at Pllk +C6 20
2 1415884_at Cela3b 47
What I want to do is in gene
column remove every characters after whitespace
in every row. So that it looks like:
probe genes sample1
0 1415777_at Pnliprp1 20
1 1415777_at Pllk 20
2 1415884_at Cela3b 47
How can it be achieved?
Upvotes: 2
Views: 1165
Reputation: 862701
The fastest solution is use list
comprehension with Series
constructor:
print pd.Series([ x.split()[0] for x in df['genes'].tolist() ])
0 Pnliprp1
1 Pllk
2 Cela3b
dtype: object
Timings len(df)=3k
:
df = pd.concat([df]*1000).reset_index(drop=True)
In [21]: %timeit pd.Series([ x.split()[0] for x in df['genes'].tolist() ])
1000 loops, best of 3: 946 µs per loop
In [22]: %timeit df['genes'].map(lambda x: x.split()[0])
1000 loops, best of 3: 1.27 ms per loop
In [23]: %timeit df['genes'].str.extract('(\w*)\s*', expand=False)
The slowest run took 4.31 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 5.08 ms per loop
In [24]: %timeit df["genes"].str.split().str[0]
100 loops, best of 3: 2.52 ms per loop
Explanation:
split()[0]
works faster, but if in column genes
are NaN
values, failed.
I think the safer
is DSM
solution, because works with NaN
very nice.
Upvotes: 2
Reputation: 31672
You could use str.extract
for that to capture first group before space:
In [26]: df['genes'].str.extract('(\w*)\s*', expand=False)
Out[26]:
0 Pnliprp1
1 Pllk
2 Cela3b
Name: genes, dtype: object
df['genes'] = df['genes'].str.extract('(\w*)\s*', expand=False)
In [29]: df
Out[29]:
probe genes sample1
0 1415777_at Pnliprp1 20
1 1415777_at Pllk 20
2 1415884_at Cela3b 47
Timing:
In [35]: %timeit df["genes"].str.split().str[0]
1000 loops, best of 3: 586 us per loop
In [36]: %timeit df['genes'].map(lambda x: x.split()[0])
10000 loops, best of 3: 153 us per loop
In [37]: %timeit df['genes'].str.extract('(\w*)\s*', expand=False)
1000 loops, best of 3: 173 us per loop
Upvotes: 1
Reputation: 353119
I'd use the vectorized str
functions:
>>> df["genes"] = df["genes"].str.split().str[0]
>>> df
probe genes sample1
0 1415777_at Pnliprp1 20
1 1415777_at Pllk 20
2 1415884_at Cela3b 47
Upvotes: 4
Reputation: 19831
Use split():
>>> df['genes'] = df['genes'].map(lambda x: x.split()[0])
>>> df
probe genes sample1
0 1415777_at Pnliprp1 20
1 1415777_at Pllk 20
2 1415884_at Cela3b 47
Upvotes: 3