Tapal Goosal
Tapal Goosal

Reputation: 373

Splitting a string based on a pattern in Python

I have long strings such as

"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"

and

"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"

I want to split them based on the pattern "a number, a space, a dash, a space, some string until the next number, a space, a dash, a space or end of string". Notice that the string may contain commas, ampersands, '>' and other special characters, so splitting on them will not work. I think there is a way in Python to split based on regular expressions but I have trouble forming that.

I have a very introductory knowledge of regular expressions. I can form a regex for numbers, as well as for alphanumeric strings, but I don't know how to specify "take everything until the next number starts".


Update: Expected output:

first case:

["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]

second case:

["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]

Upvotes: 6

Views: 3708

Answers (4)

blhsing
blhsing

Reputation: 106465

If numbers appear only at the beginning of each segment of strings, you can do:

import re
for s in "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes":
    print(re.findall(r'\d+\D+(?=,\s*\d|$)', s))

This outputs:

['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

This regex pattern uses \d+ to match numbers first, then uses \D+ to match non-numbers, and uses lookahead pattern (?=,\s*\d|$) to make sure that the non-numbers stops at the point where it's followed by either a comma, some spaces and another number, or the end of the string, so that the resulting match won't include a trailing comma and a space.

Upvotes: 3

akash karothiya
akash karothiya

Reputation: 5950

Here is the pattern, first there is some number so we use [0-9]+ followed by string and special characters like & - >, therefore we can use [a-zA-Z \-&>]+:

>>> str_ = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

Another string you mentioned in OP

>>> str_ = "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['321 - Apparel & Accessories, ', 
 '4321 - Apparel & Accessories > Handbags, Wallets & Cases, ', 
 '187 - Apparel & Accessories > Shoes']

Upvotes: 7

jhole89
jhole89

Reputation: 828

Surely it is as simple as just splitting when you encounter a numeric?

s = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
re.findall(r'\d+\D+', s) 

['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You may match substrings starting with one or more digits followed with 1+ whitespaces, -, 1+ whitespaces and ending with the same pattern or end of string.

re.findall(r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)", s, re.S)

See the regex demo

Note: If the leading number length is more than one, say, it is at least a 2-digit number, you may replace the \d+ with \d{2,}, etc. Adjust as you see fit.

Regex demo

  • \d+ - 1+ digits
  • \s+-\s+ - a - enclosed with 1+ whitespaces
  • .*? - any 0+ chars, as few as possible, up to the location in string that is followed with...
  • (?=\s*(?:,\s*)?\d+\s+-\s|\Z) - (a positive lookahead):
    • \s*(?:,\s*)?\d+\s+-\s - 0+ whitespaces, an optional substringof a comma and 0+ whitespaces after it, 1+ digits, 1+ whitespaces, - and a whitespace
    • | - or
    • \Z - end of string

Python demo:

import re

rx = r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)"
texts = ["123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"]
for s in texts:
    print("--- {} ----".format(s))
    print(re.findall(rx, s, re.S))

Output:

--- 123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products ---
['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
--- 321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes ---
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

Upvotes: 2

Related Questions