Extracting unigram and bigram in list from text

Question

I have a list of fixed sizes:

sizes = ['extra small', 'small', 'medium', 'large', 'extra large']

I would like to extract any mention of these sizes from a text. However, there's much complication with the item 'extra small' vs 'small', and 'extra large' vs 'large', when I've text like this:

text1 = 'she wears a small size and he wears an extra large'

I've came up with the following syntax to match the larger strings before trying to match the smaller strings:

import re
sizes = ['extra small', 'small', 'medium', 'large', 'extra large']
text1 = 'she wears a small size and he wears an extra large size'
mentioned_sizes = []

sizes.sort(key=lambda x: len(x.split()), reverse=True)

for x in sizes:
    if len(x.split()) > 1:
        if re.findall(x, text1):
            mentioned_sizes.append(x)
    elif len(x.split()) == 1:
        if (x in text1) and (x not in [item for sublist in [x.split() for x in mentioned_sizes] for item in sublist]):
            mentioned_sizes.append(x)

This gives me ['extra large', 'small'] for the mentioned_sizes, which is what I wanted. However, I ran into a problem when the text becomes this:

text2 = 'she wears a large size and he wears an extra large size'

I'll now get just ['extra large'] for mentioned_sizes, instead of ['extra large', 'large']. How can I extract the sizes that are mentioned in the text?

ScottC · Accepted Answer

If you re-order your sizes so that your two-word sizes come first, you can locate these sizes and then remove them from the text, so that they are not found when searching for the single-word size. Also by adding to a set you avoid having to worry about duplicate sizes in the mentioned_sizes.

Here is an example:

Code:

sizes = ['extra small', 'extra large', 'small', 'medium', 'large']

text_list = ['she wears a small size and he wears an extra large size',
             'she wears a large size and he wears an extra large size']

for text in text_list:
    mentioned_sizes = set()
    original_text = text
    for size in sizes:
        if size in text:
            mentioned_sizes.add(size)
            text = text.replace(size, "")
    print(f"Text: {original_text}
Mentioned Sizes: {mentioned_sizes}
")

Output:

Text: she wears a small size and he wears an extra large size
Mentioned Sizes: {'small', 'extra large'}

Text: she wears a large size and he wears an extra large size
Mentioned Sizes: {'large', 'extra large'}

Note:

If you want to use regex, you could do something like this to produce the same output:

for text in text_list:
    mentioned_sizes = set(re.findall('|'.join(sizes),text))
    print(f"Text: {text}
Mentioned Sizes: {mentioned_sizes}
")

Extracting unigram and bigram in list from text

Answers (1)

Code:

Output:

Note:

Related Questions