Reputation: 67968
Consider this code:
from StringIO import StringIO
import pandas as pd
txt = """a, RR
10, 1asas
20, 1asasas
30,
40, asas
50, ayty
60, 2asas
80, 3asas"""
frame = pd.read_csv(StringIO(txt), skipinitialspace=True)
print frame,"\n\n\n"
l=[]
for i,j in frame[~ frame['RR'].str.startswith("1", na=True)]['RR'].iteritems():
if j.startswith(('2','3')):
if frame[frame['RR'].str.startswith("1", na=False)]['RR'].str.match("1"+j[1:], as_indexer = True).any():
l.append(i)
else:
if frame[frame['RR'].str.startswith("1", na=False)]['RR'].str.match("1"+j, as_indexer = True).any():
l.append(i)
frame = frame.drop(frame.index[l])
print frame
What I am doing here is,
Loop through dataframes to drop any RR
which already has 1RR
in dataframe
If RR
has 2 or 3
at start , then drop if that RR
has 1RR[1:]
in dataframe.
If RR
startswith 1
or is NaN
do not touch it.
The code is working fine but this dataframe
will have up to 1 million entries and I don't think this code is optimised. As I have just started with pandas
I have limited knowledge.
Is there any way we can achieve this without iteration
. Does pandas
have any in-built utility to do this?
Upvotes: 2
Views: 347
Reputation: 16241
First, keep all strings starting with 1
or nan
:
keep = frame['RR'].str.startswith("1", na=True)
keep1 = keep[keep] # will be used at the end
Second, keep strings starting with 2
or 3
that are not in the first dataframe rr1
:
rr1 = frame.loc[frame['RR'].str.startswith("1", na=False), 'RR']
keep2 = ~frame.loc[
(frame['RR'].str.startswith("2")) | (frame['RR'].str.startswith("3")), 'RR'
].str.slice(1).isin(rr1.str.slice(1))
Third, keep other strings that are not in rr1
after adding a leading 1
:
import numpy as np
keep3 = ~("1" + frame.loc[
~frame['RR'].str.slice(0,1).isin([np.nan, "1", "2", "3"]), 'RR'
]).isin(rr1)
Finally, put everything together:
frame[pd.concat([keep1, keep2, keep3]).sort_index()]
Upvotes: 1