Reputation: 605
Trying to split and parse characters from an column and submit the parsed data into different column .
I was trying the same by parsing with _ in the given column data, It was working good until the number of '_' present in the string was fixed to 2.
Input Data:
Col1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88_(2).jpg.pdf
V_C56478_mar89Apr89
Q_d15634_Apr90Apr91
Q_d15634_Apr90Apr91_(3).jpeg.pdf
S_e15336_may91Apr93
NaN
Expected Output:
col2
Jan87Apr88
Feb88Apr88
mar89Apr89
Apr90Apr91
Apr90Apr91
may91Apr93
Code i have been trying :
df = pd.read_excel(open(r'Dats.xlsx', 'rb'), sheet_name='Sheet1')
df['Col2'] = df.Col1.str.replace(
'.*_', '', regex=True
)
print(df['Col2'])
Upvotes: 1
Views: 494
Reputation: 56
Based on your question, the pandas
DataFrame apply
can be a good solution:
NaN
s with empty string ''
df = pd.DataFrame(data=['U_a65839_Jan87Apr88', 'U_b98652_Feb88Apr88_(2).jpg.pdf', 'V_C56478_mar89Apr89', 'Q_d15634_Apr90Apr91', 'Q_d15634_Apr90Apr91_(3).jpeg.pdf', 'S_e15336_may91Apr93', None], columns=['Col1'])
df = df.fillna('')
Col1
0 U_a65839_Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf
2 V_C56478_mar89Apr89
3 Q_d15634_Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf
5 S_e15336_may91Apr93
6
regex
def fun(s):
import re
m = re.search(r'\w{3}\d{2}\w{3}\d{2}', s)
if m:
return m.group(0)
else:
return ''
df['Col2'] = df['Col1'].apply(fun)
Col1 Col2
0 U_a65839_Jan87Apr88 Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf Feb88Apr88
2 V_C56478_mar89Apr89 mar89Apr89
3 Q_d15634_Apr90Apr91 Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf Apr90Apr91
5 S_e15336_may91Apr93 may91Apr93
6
Hope the above helps.
Upvotes: 1
Reputation: 11171
I think you want this:
col2 = df.Col1.str.split("_", expand=True)[2]
output:
0 Jan87Apr88
1 Feb88Apr88
2 mar89Apr89
3 Apr90Apr91
4 Apr90Apr91
5 may91Apr93
6 NaN
(you can dropna
if you don't want the last row)
Upvotes: 3
Reputation: 521289
Use str.extract
here:
df["col2"] = df["Col1"].str.extract(r'((?:[a-z]{3}\d{2}){2})', flags=re.IGNORECASE)
Upvotes: 1