Reputation: 531
I have a column that consist of 8000 rows, and I need to create a new column which the value is extracted from the existing column.
the string shows like this:
TP-ETU06-01-525-W-133
and I want to create two new columns from the string where the value of first new column is extracted from the second string which ETU06 and the second one is from the last string which is 133.
I have done this by using:
df["sys_no"] = df.apply(lambda x:x["test_no"].split("-")[1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)
df["package_no"] = df.apply(lambda x:x["test_no"].split("-")[-1] if (pd.notnull(x["test_no"]) and x["test_no"]!="" and len(x["test_no"].split("-"))>0) else None,axis=1)
It actually works fine, but the existing column has random string that doesn't follow the others. So I want to leave empty in the new columns if the random string appears.
How should I change my script?
Thankyou
Upvotes: 1
Views: 84
Reputation: 9681
This approach uses regex and named capture groups to find and extract the strings of interest, in just two lines of code.
split
:It is true that regex is not required. However, from the standpoint of data validation, using regex helps to prevent 'stray' data from creeping in. Using a 'blind' split()
function splits the data on (a character); but what if the source data has changed? The split function is blind to this. Whereas, using regex will help to highlight an issue as the pattern simply won't match. Yes, you may get an error message - but this is a good thing as you'll be alerted to a data format change, providing the opportunity to address the issue, or update the regex pattern.
Additionally, regex provides a robust solution as the pattern matches the entire string, and anything outside of this pattern is ignored - like the example mentioned in the question.
If you'd like some explanation on the regex pattern itself, just add a comment and I'll update the answer to explain.
test_no
0 TP-ETU05-01-525-W-005
1 TP-ETU06-01-525-W-006
2 TP-ETU07-01-525-W-007
3 TP-ETU08-01-525-W-008
4 TP-ETU09-01-525-W-009
5 NaN
6 NaN
7 otherstuff
import re
exp = re.compile(r'^[A-Z]{2}-(?P<sys_no>[A-Z]{3}\d{2})-\d{2}-\d{3}-[A-Z]-(?P<package_no>\d{3})$')
df[['sys_no', 'package_no']] = df['test_no'].str.extract(exp, expand=True)
test_no sys_no package_no
0 TP-ETU05-01-525-W-005 ETU05 005
1 TP-ETU06-01-525-W-006 ETU06 006
2 TP-ETU07-01-525-W-007 ETU07 007
3 TP-ETU08-01-525-W-008 ETU08 008
4 TP-ETU09-01-525-W-009 ETU09 009
5 NaN NaN NaN
6 NaN NaN NaN
7 otherstuff NaN NaN
Upvotes: 1
Reputation: 862541
Use Series.str.contains
for mask, then split values by Series.str.split
and select secnd and last value by indexing only filtered rows by mask:
print (df)
test_no
0 temp data
1 NaN
2 TP-ETU06-01-525-W-133
mask = df["test_no"].str.contains('-', na=False)
splitted = df["test_no"].str.split("-")
df.loc[mask, "sys_no"] = splitted[mask].str[1]
df.loc[mask, "package_no"] = splitted[mask].str[-1]
print (df)
test_no sys_no package_no
0 temp data NaN NaN
1 NaN NaN NaN
2 TP-ETU06-01-525-W-133 ETU06 133
Upvotes: 1