Reputation: 3336
I have some task with text and I need to split string into words properly. For my task I am using Python3.
That way doesn't appropriate for me:
re.sub("[^\w]", " ", hotelName.lower()).split()
because words in sentences like this:
"[{(St.Augst bridge), South-West]} . a - a Torreluca! B&B O'Dell! & Cabin& Wastlgasse MM-505?."
are spitted in the list:
['st', 'augst', 'bridge', 'south', 'west', 'torreluca', 'b', 'b', 'o',
'dell', 'cabin', 'wastlgasse', 'mm', '505']
but I need to split terms in that way (to save entire terms):
["st.augst", "bridge", "South-West", "Torreluca", "B&B", "O'Dell",
"Cabin", "Wastlgasse", "MM-505"]
It means that I need split text by:
I will really appreciate if somebody who familiar with regexp will help me with that task. It seems to be quite common task to get terms from document.
Upvotes: 0
Views: 269
Reputation: 2120
Search for patterns of non-whitespace between word boundaries \b
:
import re
hotel_name = "(St.Augst bridge), South-West Torreluca! B&B O'Dell Cabin Wastlgasse MM-505?"
REGEX = r"\b\S+\b"
finder = re.compile(REGEX)
matches = re.findall(finder, hotel_name)
print(matches)
Output:
['St.Augst', 'bridge', 'South-West', 'Torreluca', 'B&B', "O'Dell", 'Cabin', 'Wastlgasse', 'MM-505']
Upvotes: 2
Reputation: 43495
First, translate out the stuff you don't want, then split.
In [26]: test = "(St.Augst bridge), South-West Torreluca! B&B O'Dell Cabin Wastlgasse MM-505?"
In [27]: test.translate({ord(j): None for j in ',?!()'}).split()
Out[27]:
['St.Augst',
'bridge',
'South-West',
'Torreluca',
'B&B',
"O'Dell",
'Cabin',
'Wastlgasse',
'MM-505']
Upvotes: 1
Reputation: 10403
Anwser updated to work with python3
Well there may be a better way, but what following works:
import re
string = "(St.Augst bridge), South-West Torreluca! B&B O'Dell Cabin Wastlgasse MM-505?"
wordlist = re.split(r'[()!?,]|\.?\s+', string)
wordlist = list(filter(lambda a: a != '', wordlist))
print(wordlist)
Output:
['St.Augst', 'bridge', 'South-West', 'Torreluca', 'B&B', "O'Dell", 'Cabin', 'Wastlgasse', 'MM-505']
Regex pattern [()!?,]|\.?\s+
can be read like "all characters '(', ')', '!', '?' or ',' OR a whitespace which can be preceded by a dot"
Because sometimes we are splitting parts of strings that exactly matches with ',' by example, re.split
will returns list containing empties strings, that why I filter output list at line 4.
Upvotes: 1