Reputation: 373
I have long strings such as
"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
and
"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
I want to split them based on the pattern "a number, a space, a dash, a space, some string until the next number, a space, a dash, a space or end of string". Notice that the string may contain commas, ampersands, '>' and other special characters, so splitting on them will not work. I think there is a way in Python to split based on regular expressions but I have trouble forming that.
I have a very introductory knowledge of regular expressions. I can form a regex for numbers, as well as for alphanumeric strings, but I don't know how to specify "take everything until the next number starts".
Update: Expected output:
first case:
["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]
second case:
["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]
Upvotes: 6
Views: 3708
Reputation: 106465
If numbers appear only at the beginning of each segment of strings, you can do:
import re
for s in "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes":
print(re.findall(r'\d+\D+(?=,\s*\d|$)', s))
This outputs:
['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']
This regex pattern uses \d+
to match numbers first, then uses \D+
to match non-numbers, and uses lookahead pattern (?=,\s*\d|$)
to make sure that the non-numbers stops at the point where it's followed by either a comma, some spaces and another number, or the end of the string, so that the resulting match won't include a trailing comma and a space.
Upvotes: 3
Reputation: 5950
Here is the pattern, first there is some number so we use [0-9]+
followed by string and special characters like &
-
>
, therefore we can use [a-zA-Z \-&>]+
:
>>> str_ = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['123 - Footwear, ',
'5678 - Apparel, Accessories & Luxury Goods, ',
'9876 - Leisure Products']
Another string you mentioned in OP
>>> str_ = "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['321 - Apparel & Accessories, ',
'4321 - Apparel & Accessories > Handbags, Wallets & Cases, ',
'187 - Apparel & Accessories > Shoes']
Upvotes: 7
Reputation: 828
Surely it is as simple as just splitting when you encounter a numeric?
s = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
re.findall(r'\d+\D+', s)
['123 - Footwear, ',
'5678 - Apparel, Accessories & Luxury Goods, ',
'9876 - Leisure Products']
Upvotes: 2
Reputation: 626748
You may match substrings starting with one or more digits followed with 1+ whitespaces, -
, 1+ whitespaces and ending with the same pattern or end of string.
re.findall(r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)", s, re.S)
See the regex demo
Note: If the leading number length is more than one, say, it is at least a 2-digit number, you may replace the \d+
with \d{2,}
, etc. Adjust as you see fit.
Regex demo
\d+
- 1+ digits\s+-\s+
- a -
enclosed with 1+ whitespaces.*?
- any 0+ chars, as few as possible, up to the location in string that is followed with...(?=\s*(?:,\s*)?\d+\s+-\s|\Z)
- (a positive lookahead):
\s*(?:,\s*)?\d+\s+-\s
- 0+ whitespaces, an optional substringof a comma and 0+ whitespaces after it, 1+ digits, 1+ whitespaces, -
and a whitespace|
- or\Z
- end of stringimport re
rx = r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)"
texts = ["123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"]
for s in texts:
print("--- {} ----".format(s))
print(re.findall(rx, s, re.S))
Output:
--- 123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products ---
['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
--- 321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes ---
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']
Upvotes: 2