Reputation: 95
I have an excel file with 2 columns. I want to remove some parts of the string in column 2 (C2). But the problem is, as the file is huge I don’t know the exact name that I want to remove; however I know which parts I want to keep. The other issue is the part I want to keep is only one letter which they will be in the part I want to remove too. The following is an example:
The original part is like:(C1 and C2 are columns name)
C1 C2
T1 L_1>K>J>P000RTK>P
T2 K>L>L>PY0BDJS
T3 P>P>P000FTKL>L
I need the results like: I only want to keep the part with one letter and I
want to remove the rest.
C1 C2
T1 L_1>K>J>P
T2 K>L>L
T3 P>P>L
Thanks
Upvotes: 1
Views: 794
Reputation: 7690
If you are using pandas DataFrame to read the file, you can use pd.replace() with regex on your DataFrame to remove unwanted cell values.
>> df.replace(">\w{2,}", "", regex=True)
C1 C2
0 T1 L_1>K>J>P
1 T2 K>L>L
2 T3 P>P>L
Disclaimer: There are cases the regex I've used may fail, such as P000RTK>L_1>K>J>P
(thanks @piRSquared for pointing it out). This was given as an example (using values from the question), you need to implement your own regex (the one that fits your needs) when using replace with regex.
Upvotes: 3
Reputation: 294258
According to your condition, you want to keep only those parts that contain one letter. That implies you want to remove things like
'P_K'
: non-contiguous multiple letters'PK_'
: contiguous multiple lettersMy strategy is to split strings by '>'
and filter out those elements whose letter counts exceed 1
f = lambda x: x.str.count('[A-Za-z]') < 2
s = df.C2.str.split('>', expand=True).stack()
df.assign(C2=s.compress(f).groupby(level=0).apply('>'.join))
C1 C2
0 T1 L_1>K>J>P
1 T2 K>L>L
2 T3 P>P>L
Upvotes: 2