Reputation: 35

How to match multiple words as a single entry with Regex?

I have a list of items that also includes the type and weight/size of the item. I am trying to extract the item names. I tried several different approaches, but the closest I got was extracting every word as a single entry.

The regrex pattern I used:

pattern_2=re.compile(r'[a-zA-Z]+\s')

I get this result:

list=['Milk ','Loaf ','of ','Fresh ','White ','Bread ','Rice ']

the result that I want is this:

list=['Milk','Loaf of Fresh White Bread']

I tried the pattern proposed here but it matches the entire list as a block. Regular expression matching a multiline block of text

Portion of my list:

list=['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']

The list itself is longer, so I am trying to find a pattern that can be used for the entire list. Is it possible to write a regex pattern that matches the list items as a whole?

Upvotes: 1

Answers (3)

alexdjulin

Reputation: 79

You can use this regex to get all words or group of words composed of caracters and spaces but NOT ALLOWING brackets. It will return you the matching elements with leading and trailing spaces, that we can get rid of using the strip() method.

import re

pattern_2=re.compile(r'([a-zA-Z\s]+\s)')

lst = ['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
string = "Milk (regular) (1 gallon), Loaf of Fresh White Bread (1 lb), Rice (white) (1 lb), Eggs (regular) (12), Local Cheese (1 lb)"

# for a string
result_string = [s.strip() for s in pattern_2.findall(string)]
print(result_string)
# for a list
result_lst = [s.strip() for s in pattern_2.findall(', '.join(lst))]
print(result_lst)

''' Output
['Milk', 'Loaf of Fresh White Bread', 'Rice', 'Eggs', 'Local Cheese']
['Milk', 'Loaf of Fresh White Bread', 'Rice', 'Eggs', 'Local Cheese']
'''

Upvotes: 1

jeff pentagon

Reputation: 856

import re

s = re.findall(r'[^()]+', 'Loaf of Fresh White Bread (1 lb)')[0].rstrip()

to apply this to whole list use the following code. (given_list->result_list)

import re

given_list = ['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
result_list = [re.findall(r'[^()]+', x)[0].rstrip() for x in given_list]
print(result_list) 
# prints ['Milk', 'Loaf of Fresh White Bread', 'Rice', 'Eggs', 'Local Cheese']

Using regex is very tricky.

I recommend you to take a look at regular expression automata theory to be familar with this tool.

Explanation of the code:

r'[^()]+' can be dissected into []+ and ^()

'[]' is a set of tokens(letters).

we define some set of tokens within [].

'+' means iteration of at least 1 time.

'[]+' means that certain set of tokens have been iterated 1 or more times.

'^' means complement set.

In simple terms it means "set of everything except something"

"something" here is '(', and ')'.

so "everything but parentheses" set is made.

and iteration of that set of more than 1 times.

So in human language this means

"a string of any character except '(' or ')', of length 1 or more."

findall method finds all substrings that satisfy this condition,

and makes a list of it.

[0] returns the first element of it.

rstrip removes the trailing whitespace since we couldn't remove it with regex.

Since you only need the first result of this regex search, re.search can do the job faster. (it finds the first match and stops) Example:

import re

given_list = ['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
result_list = [re.search(r'[^()]+', x).group(0).rstrip() for x in given_list]
print(result_list) 
# prints ['Milk', 'Loaf of Fresh White Bread', 'Rice', 'Eggs', 'Local Cheese']

Upvotes: 1

Wiktor Stribiżew

Reputation: 626870

You can use

import re
l=['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
for s in l:
    m = re.search(r'^[a-z]+(?:\s+[a-z]+)*', s, re.I)
    if m:
        print(m.group())

Or, if you use Python 3.8+:

import re
l=['Milk (regular) (1 gallon)', 'Loaf of Fresh White Bread (1 lb)', 'Rice (white) (1 lb)', 'Eggs (regular) (12)', 'Local Cheese (1 lb)']
print( [m.group() for s in l if (m := re.search(r'^[a-z]+(?:\s+[a-z]+)*', s, re.I))] )

Ouput:

Milk
Loaf of Fresh White Bread
Rice
Eggs
Local Cheese

See the online Python demo.

The ^[a-z]+(?:\s+[a-z]+)* regex matches one or more letters and then zero or more occurrences of one or more letters at the start of a string, in a case insensitive way due to re.I option.

Upvotes: 1

How to match multiple words as a single entry with Regex?

Answers (3)

Related Questions