neversaint
neversaint

Reputation: 64014

Remove characters after whitespace for every row in a Pandas column

I have the following data frame:

import pandas as pd
import io
temp=u"""probe,genes,sample1
1415777_at,Pnliprp1 +OX(M6),20
1415777_at,Pllk +C6,20
1415884_at,Cela3b,47"""
df = pd.read_csv(io.StringIO(temp))
df

Which looks like this:

Out[23]:
        probe             genes  sample1
0  1415777_at  Pnliprp1 +OX(M6)       20
1  1415777_at          Pllk +C6       20
2  1415884_at            Cela3b       47

What I want to do is in gene column remove every characters after whitespace in every row. So that it looks like:

        probe             genes  sample1
0  1415777_at           Pnliprp1      20
1  1415777_at               Pllk      20
2  1415884_at             Cela3b      47

How can it be achieved?

Upvotes: 2

Views: 1165

Answers (4)

jezrael
jezrael

Reputation: 862701

The fastest solution is use list comprehension with Series constructor:

print pd.Series([ x.split()[0] for x in df['genes'].tolist() ])
0    Pnliprp1
1        Pllk
2      Cela3b
dtype: object

Timings len(df)=3k:

df = pd.concat([df]*1000).reset_index(drop=True)

In [21]: %timeit pd.Series([ x.split()[0] for x in df['genes'].tolist() ])
1000 loops, best of 3: 946 µs per loop

In [22]: %timeit df['genes'].map(lambda x: x.split()[0])
1000 loops, best of 3: 1.27 ms per loop

In [23]: %timeit df['genes'].str.extract('(\w*)\s*', expand=False)
The slowest run took 4.31 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 5.08 ms per loop

In [24]: %timeit df["genes"].str.split().str[0]
100 loops, best of 3: 2.52 ms per loop

Explanation:

split()[0] works faster, but if in column genes are NaN values, failed.

I think the safer is DSM solution, because works with NaN very nice.

Upvotes: 2

Anton Protopopov
Anton Protopopov

Reputation: 31672

You could use str.extract for that to capture first group before space:

In [26]: df['genes'].str.extract('(\w*)\s*', expand=False)
Out[26]:
0    Pnliprp1
1        Pllk
2      Cela3b
Name: genes, dtype: object


df['genes'] = df['genes'].str.extract('(\w*)\s*', expand=False)
In [29]: df
Out[29]:
        probe     genes  sample1
0  1415777_at  Pnliprp1       20
1  1415777_at      Pllk       20
2  1415884_at    Cela3b       47

Timing:

In [35]: %timeit df["genes"].str.split().str[0]
1000 loops, best of 3: 586 us per loop

In [36]: %timeit df['genes'].map(lambda x: x.split()[0])
10000 loops, best of 3: 153 us per loop

In [37]: %timeit df['genes'].str.extract('(\w*)\s*', expand=False)
1000 loops, best of 3: 173 us per loop

Upvotes: 1

DSM
DSM

Reputation: 353119

I'd use the vectorized str functions:

>>> df["genes"] = df["genes"].str.split().str[0]
>>> df
        probe     genes  sample1
0  1415777_at  Pnliprp1       20
1  1415777_at      Pllk       20
2  1415884_at    Cela3b       47

Upvotes: 4

AKS
AKS

Reputation: 19831

Use split():

>>> df['genes'] = df['genes'].map(lambda x: x.split()[0])
>>> df
        probe     genes  sample1
0  1415777_at  Pnliprp1       20
1  1415777_at      Pllk       20
2  1415884_at    Cela3b       47

Upvotes: 3

Related Questions