Manz
Manz

Reputation: 605

Parse only specific characters from the string using python

Trying to split and parse characters from an column and submit the parsed data into different column .

I was trying the same by parsing with _ in the given column data, It was working good until the number of '_' present in the string was fixed to 2.

Input Data:

        Col1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88_(2).jpg.pdf
V_C56478_mar89Apr89
Q_d15634_Apr90Apr91
Q_d15634_Apr90Apr91_(3).jpeg.pdf
S_e15336_may91Apr93
NaN

Expected Output:

  col2
Jan87Apr88
Feb88Apr88
mar89Apr89
Apr90Apr91
Apr90Apr91
may91Apr93

Code i have been trying :

df = pd.read_excel(open(r'Dats.xlsx', 'rb'), sheet_name='Sheet1')

df['Col2'] = df.Col1.str.replace(
    '.*_', '', regex=True
)
print(df['Col2'])

Upvotes: 1

Views: 494

Answers (3)

v2sciences
v2sciences

Reputation: 56

Based on your question, the pandas DataFrame apply can be a good solution:

  • First, clean the DataFrame by replacing NaNs with empty string ''
df = pd.DataFrame(data=['U_a65839_Jan87Apr88', 'U_b98652_Feb88Apr88_(2).jpg.pdf', 'V_C56478_mar89Apr89', 'Q_d15634_Apr90Apr91', 'Q_d15634_Apr90Apr91_(3).jpeg.pdf', 'S_e15336_may91Apr93', None], columns=['Col1'])

df = df.fillna('')
                               Col1
0               U_a65839_Jan87Apr88
1   U_b98652_Feb88Apr88_(2).jpg.pdf
2               V_C56478_mar89Apr89
3               Q_d15634_Apr90Apr91
4  Q_d15634_Apr90Apr91_(3).jpeg.pdf
5               S_e15336_may91Apr93
6
  • Next, define a function to extract the required string with regex
def fun(s):
    import re
    m = re.search(r'\w{3}\d{2}\w{3}\d{2}', s)
    if m:
        return m.group(0)
    else:
        return ''
  • Then, easily apply the function to DataFrame:
df['Col2'] = df['Col1'].apply(fun)
                               Col1        Col2
0               U_a65839_Jan87Apr88  Jan87Apr88
1   U_b98652_Feb88Apr88_(2).jpg.pdf  Feb88Apr88
2               V_C56478_mar89Apr89  mar89Apr89
3               Q_d15634_Apr90Apr91  Apr90Apr91
4  Q_d15634_Apr90Apr91_(3).jpeg.pdf  Apr90Apr91
5               S_e15336_may91Apr93  may91Apr93
6

Hope the above helps.

Upvotes: 1

anon01
anon01

Reputation: 11171

I think you want this:

col2 = df.Col1.str.split("_", expand=True)[2]

output:

0    Jan87Apr88
1    Feb88Apr88
2    mar89Apr89
3    Apr90Apr91
4    Apr90Apr91
5    may91Apr93
6           NaN

(you can dropna if you don't want the last row)

Upvotes: 3

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521289

Use str.extract here:

df["col2"] = df["Col1"].str.extract(r'((?:[a-z]{3}\d{2}){2})', flags=re.IGNORECASE)

Demo

Upvotes: 1

Related Questions