Sergey Luchko
Sergey Luchko

Reputation: 3336

Python extract whitespace-separated words that may include specific punctuation symbols

I have some task with text and I need to split string into words properly. For my task I am using Python3.

That way doesn't appropriate for me:

re.sub("[^\w]", " ", hotelName.lower()).split()

because words in sentences like this:

"[{(St.Augst bridge), South-West]} . a - a Torreluca! B&B O'Dell! & Cabin& Wastlgasse MM-505?."

are spitted in the list:

 ['st', 'augst', 'bridge', 'south', 'west', 'torreluca', 'b', 'b', 'o',
 'dell', 'cabin', 'wastlgasse', 'mm', '505']

but I need to split terms in that way (to save entire terms):

 ["st.augst", "bridge", "South-West", "Torreluca", "B&B", "O'Dell", 
"Cabin", "Wastlgasse", "MM-505"]

It means that I need split text by:

I will really appreciate if somebody who familiar with regexp will help me with that task. It seems to be quite common task to get terms from document.

Upvotes: 0

Views: 269

Answers (3)

Crispin
Crispin

Reputation: 2120

Search for patterns of non-whitespace between word boundaries \b:

import re

hotel_name = "(St.Augst bridge), South-West Torreluca! B&B O'Dell Cabin Wastlgasse MM-505?"

REGEX = r"\b\S+\b"
finder = re.compile(REGEX)

matches = re.findall(finder, hotel_name)
print(matches) 

Output:

['St.Augst', 'bridge', 'South-West', 'Torreluca', 'B&B', "O'Dell", 'Cabin', 'Wastlgasse', 'MM-505']

Upvotes: 2

Roland Smith
Roland Smith

Reputation: 43495

First, translate out the stuff you don't want, then split.

In [26]: test = "(St.Augst bridge), South-West Torreluca! B&B O'Dell Cabin Wastlgasse MM-505?"

In [27]: test.translate({ord(j): None for j in ',?!()'}).split()
Out[27]: 
['St.Augst',
 'bridge',
 'South-West',
 'Torreluca',
 'B&B',
 "O'Dell",
 'Cabin',
 'Wastlgasse',
 'MM-505']

Upvotes: 1

Arount
Arount

Reputation: 10403

Anwser updated to work with python3

Well there may be a better way, but what following works:

import re
string = "(St.Augst bridge), South-West Torreluca! B&B O'Dell Cabin Wastlgasse MM-505?"
wordlist = re.split(r'[()!?,]|\.?\s+', string)
wordlist = list(filter(lambda a: a != '', wordlist))
print(wordlist)

Output:

['St.Augst', 'bridge', 'South-West', 'Torreluca', 'B&B', "O'Dell", 'Cabin', 'Wastlgasse', 'MM-505']

Regex pattern [()!?,]|\.?\s+ can be read like "all characters '(', ')', '!', '?' or ',' OR a whitespace which can be preceded by a dot"

Because sometimes we are splitting parts of strings that exactly matches with ',' by example, re.split will returns list containing empties strings, that why I filter output list at line 4.

Upvotes: 1

Related Questions