yangyang
yangyang

Reputation: 531

Pandas: How to extract a string from another string

I have a column that consist of 8000 rows, and I need to create a new column which the value is extracted from the existing column.

the string shows like this:

TP-ETU06-01-525-W-133

and I want to create two new columns from the string where the value of first new column is extracted from the second string which ETU06 and the second one is from the last string which is 133.

I have done this by using:

df["sys_no"] = df.apply(lambda x:x["test_no"].split("-")[1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)

df["package_no"] = df.apply(lambda x:x["test_no"].split("-")[-1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)

It actually works fine, but the existing column has random string that doesn't follow the others. So I want to leave empty in the new columns if the random string appears.

How should I change my script?

Thankyou

Upvotes: 1

Views: 84

Answers (2)

s3dev
s3dev

Reputation: 9681

This approach uses regex and named capture groups to find and extract the strings of interest, in just two lines of code.

Benefit of regex over split:

It is true that regex is not required. However, from the standpoint of data validation, using regex helps to prevent 'stray' data from creeping in. Using a 'blind' split() function splits the data on (a character); but what if the source data has changed? The split function is blind to this. Whereas, using regex will help to highlight an issue as the pattern simply won't match. Yes, you may get an error message - but this is a good thing as you'll be alerted to a data format change, providing the opportunity to address the issue, or update the regex pattern.

Additionally, regex provides a robust solution as the pattern matches the entire string, and anything outside of this pattern is ignored - like the example mentioned in the question.

If you'd like some explanation on the regex pattern itself, just add a comment and I'll update the answer to explain.

Sample Data:

                 test_no
0  TP-ETU05-01-525-W-005
1  TP-ETU06-01-525-W-006
2  TP-ETU07-01-525-W-007
3  TP-ETU08-01-525-W-008
4  TP-ETU09-01-525-W-009
5                    NaN
6                    NaN
7             otherstuff

Code:

import re

exp = re.compile(r'^[A-Z]{2}-(?P<sys_no>[A-Z]{3}\d{2})-\d{2}-\d{3}-[A-Z]-(?P<package_no>\d{3})$')
df[['sys_no', 'package_no']] = df['test_no'].str.extract(exp, expand=True)

Output:

                 test_no  sys_no package_no
0  TP-ETU05-01-525-W-005   ETU05        005
1  TP-ETU06-01-525-W-006   ETU06        006
2  TP-ETU07-01-525-W-007   ETU07        007
3  TP-ETU08-01-525-W-008   ETU08        008
4  TP-ETU09-01-525-W-009   ETU09        009
5                    NaN     NaN        NaN
6                    NaN     NaN        NaN
7             otherstuff     NaN        NaN

Upvotes: 1

jezrael
jezrael

Reputation: 862541

Use Series.str.contains for mask, then split values by Series.str.split and select secnd and last value by indexing only filtered rows by mask:

print (df)
                 test_no
0              temp data
1                    NaN
2  TP-ETU06-01-525-W-133

mask = df["test_no"].str.contains('-', na=False)
splitted = df["test_no"].str.split("-")
df.loc[mask, "sys_no"] = splitted[mask].str[1]
df.loc[mask, "package_no"] = splitted[mask].str[-1]
print (df)
                 test_no sys_no package_no
0              temp data    NaN        NaN
1                    NaN    NaN        NaN
2  TP-ETU06-01-525-W-133  ETU06        133

    

Upvotes: 1

Related Questions