sanna
sanna

Reputation: 1538

Retrieve a specific substring from each element in a list

It is few hours I am stuck with this: I have a Series called size_col of 887 elements and I want to retrieve from the sizes: S, M, L, XL. I have tried 2 different approaches, list comprehension and a simple if elif loop, but both attempts do not work.

sizes = ['S', 'M', 'L', 'XL']

tshirt_sizes = []
[tshirt_sizes.append(i) for i in size_col if i in sizes]

Second attempt:

sizes = []
for i in size_col:
if len(i) < 15:
   sizes.append(i.split(" / ",1)[-1])
else:
   sizes.append(i.split(" - ",1)[-1])

I created two conditions because in some cases the size follows the ' - ' and in some other the is a '/'. I honestly don't know how do deal with that.

Example of the list:

T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "Honey" - L
T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "I do very bad things" - M
T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "Stai nel tuo (mind your business)" - White / S
T-Shirt Donna "Stay Stronz" - White / L
T-Shirt Donna "Stay Stronz" - White / M
T-Shirt Donna "Si dai. Ciao." - S
T-Shirt Donna "Je suis esaurit" - Black / S
T-Shirt Donna "Si dai. Ciao." - S
T-Shirt Donna "Teamo - Tequila" - S / T-Shirt

Upvotes: 0

Views: 70

Answers (4)

codewhat
codewhat

Reputation: 11

Here's a modified version of your second attempt using regex:

import re
    
sizes = []
for i in size_col:
    size_pattern = re.search(r'(?i)\b[SMLXL]+\b', i)
    if size_pattern:
        sizes.append(size_pattern.group().upper())

Upvotes: 0

cs95
cs95

Reputation: 402483

You'll need regular expressions here. Precompile a regex pattern and then use pattern.search inside a list comprehension.

sizes = ['S', 'M', 'L', 'XL']
p = re.compile(r'\b({})\b'.format('|'.join(sizes))) 

tshirt_sizes = [p.search(i).group(0) for i in size_col]

print(tshirt_sizes)
['M', 'L', 'M', 'M', 'M', 'S', 'L', 'M', 'S', 'S', 'S', 'S']

For added security, you may want a loop instead - list comprehensions are not good with error handling:

tshirt_sizes = []
for i in size_col:
    try:
        tshirt_sizes.append(p.search(i).group(0))
    except AttributeError:
        tshirt_sizes.append(None)

Really the only reason to use regex here is to handle the last row in your data appropriately. In general, if you can, you should prefer the use of string operations (namely, str.split) unless avoidable, they're much faster and readable than regular expression based pattern matching and extraction.

Upvotes: 3

EdR
EdR

Reputation: 533

There are two aspects to this question, 1) the best method of looping over the element and 2) the correct way to split the string.

In the general case, list comprehensions are probably the right approach for this type of problem, but you have correctly identified the splitting the string correctly is tricky.

For this type of problem regular expressions are very powerful and (at the risk of complicating this compared to the previous answers) you could use something like:

import re
pattern = re.compile(r'[-/] (A-Z)$') # select any uppercase letters after either - or / and a space and before the end of the line (marked by $)

sizes = [pattern.search(item).group(1) for item in size_col] # group 1 selects the set of characters in the first set of parentheses (the letters)

Edited: just saw the edit to the posts stating that the item is not always at the end, and COLDSPEED's answer duplicates this one...

Upvotes: 0

Gianluca Micchi
Gianluca Micchi

Reputation: 1653

You can do something like that:

available_sizes = ["S", "M", "L", "XL"]
sizes = []

for i in size_col:
    for w in i.split():
        if w in available_sizes:
            sizes.append(w)

This wouldn't work if the text contains the words in available_sizes more than once, for example T-Shirt Donna "La S è la più bella consonante" - M, since it would add both S and M to the list.


Original answer, before OP specified that the size is not always the last word.

Almost. Just split the string in words and take the last one.

sizes = []
for i in size_col:
    sizes.append(i.split()[-1])

Upvotes: 0

Related Questions