Reputation: 3282
I have a regex that looks like this to extract order numbers from columns:
df["Orders"].str.extract('([0-9]{9,10}[/+ #_;.-]?)')
The orders column can look like this:
12
123456789
1234567890
123456789/1234567890
123456789/1/123456789
123456789+1234567890
The resulting new column in the dataframe after the regex should look like this:
NaN
123456789
1234567890
123456789/1234567890
123456789/123456789
123456789+1234567890
However, with my current regex I'm getting the following result:
NaN
123456789
1234567890
123456789/
123456789/
123456789+
How can I get the result that I'm looking for?
Upvotes: 2
Views: 576
Reputation: 317
You can adapt the next code to work with data frames,
RegExp: (?:^|([/+ #_;.-]))(?:\d{1,8})(?!\d)
(?:\d{1,8})(?!\d)
- find a number (<9 digits)([/+ #_;.-])
- preceded by one/none of the possible delimiters (group #1)Conditionally replace with NaN
or empty string - subst
uses match.group(1)
to differ between the two options:
12
/1
import re
regex = r"(?:^|([/+ #_;.-]))(?:\d{1,8})(?!\d)"
test_str = ("12\n"
"123456789\n"
"1234567890\n"
"123456789/1234567890\n"
"123456789/1/123456789\n"
"123456789+1234567890")
def subst(match):
m = match.group(1)
return "" if m else "NaN"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print(result)
Output:
NaN
123456789
1234567890
123456789/1234567890
123456789/123456789
123456789+1234567890
Upvotes: 0
Reputation: 626748
You can use
import pandas as pd
df = pd.DataFrame({'Orders':['12','123456789','1234567890','123456789/1234567890','123456789/1/123456789','123456789+1234567890', 'Order number: 6508955960_000010_1005500']})
df["Result"] = df["Orders"].str.findall(r'[/+ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9])').str.join('').str.lstrip('/+ #_;.-')
df.loc[df['Result'] == '', 'Result'] = np.nan
See the regex demo. Details
[/+ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9])
- matches an optional /
, +
, space, #
, _
, ;
, .
or -
char, and then none or ten digit number not enclosed with other digitsSeries.str.findall
extracts all occurrences.str.join('')
concatenates the matches into a single string.str.lstrip('/+ #_;.-')
- removes the special chars that were matched with the number at the beginning of the stringdf.loc[df['Result'] == '', 'Result'] = np.nan
- if needed - replaces empty strings with np.nan
values in the Result
column.Output:
>>> df
Orders Result
0 NaN NaN
1 123456789 123456789
2 1234567890 1234567890
3 123456789/1234567890 123456789/1234567890
4 123456789/1/123456789 123456789/123456789
5 123456789+1234567890 123456789+1234567890
>>>
Upvotes: 1