Reputation: 386
I would like to split the following
11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey
11/2/2019 Pending sale $959,000
into
['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne Freeze-Manning, Kevin Garvey']
['11/2/2019', 'Pending sale', '$959,000']
I've tried with regex, but not had any luck figuring out how to do a re.split()
combination that can accomplish the splitting except between words and after commas.
How can I accomplish this?
Upvotes: 3
Views: 331
Reputation: 25
Try this code:
import re
l = '11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey'
l = l.replace(" ", '&') # replace the & for a character that you are ensure that won't be in you string
l = l.replace(',&', ', ') # This ensures the maintence of the "after comma words"
result = re.sub(r'([^0-9, %])&([^0-9, $])', r'\1 \2', l) # Now every white space is a & char, you know that it must be splited if the previous item is a number (price in this case) a percentage symbol, the next word should be the $ (also indicating the price), or a number. If the pattern does't follow this rules, it is considered a word that won't be splited. Note, the code replace just the & ('after words' and 'after commas) for ' ' and keep the rest of the regex pattern intact.
result = result.split('&') # Now just the itens that must be splited has the & between them.
print(result)
Upvotes: 0
Reputation: 147156
You can use this regex, which looks for a space which is not preceded by a letter or comma, or is not followed by a letter:
(?<![a-z,]) | (?![a-z])
In python:
import re
a = "11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey"
b = "11/2/2019 Pending sale $959,000"
print(re.split(r'(?<![a-z,]) | (?![a-z])', a, 0, re.IGNORECASE))
print(re.split(r'(?<![a-z,]) | (?![a-z])', b, 0, re.IGNORECASE))
Output:
['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne Freeze-Manning, Kevin Garvey']
['11/2/2019', 'Pending sale', '$959,000']
Upvotes: 3
Reputation: 58
Where are you getting your data from? Is it from a CSV? Can you change the separators into commas or something else?
Right now you can only use spaces for your separator.
E.g.:
>>> x = '11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey'
>>> x.split(" ")
['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne', 'Freeze-Manning,', 'Kevin
', 'Garvey']
Notice that it chops up the string 'Suzanne Freeze-Manning, Kevin Garvey'
If you had tabs as your separators, you could easily do something like this:
E.g.:
>>> x = '11/27/2019\tSold\t$900,000\t-6.2%\tSuzanne Freeze-Manning, Kevin Garvey'
>>> print(x)
11/27/2019 Sold $900,000 -6.2% Suzanne Freeze-Manning, Kevin Garvey
>>> x.split("\t")
['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne Freeze-Manning, Kevin Garvey']
Optionally, if you will always have 5 columns of data, such as your first string, you could tell it to stop splicing after the fourth iteration.
E.g.:
>>> x.split(" ",4)
['11/27/2019', 'Sold', '$900,000', '-6.2%', 'Suzanne Freeze-Manning, Kevin Garvey']
See https://docs.python.org/3.6/library/stdtypes.html#str.split for more details about the separators.
Upvotes: 0