Prakhar Jhudele
Prakhar Jhudele

Reputation: 965

get substring Python inside list elements- Web Scraping

Please pardon me i am new to Python and Selenium. I am scraping a supermarket website. I get the items name as below which has quantity with the name. I want to substring the quantity from the name for different cases and items as below-

Cases

Fresh Value Colocasia 250g

Fresh Value Banana Robusta 1kg

Fresh Value Raw Papaya 1 U (units) (300g-400g)

Fresh Value Premium Pomegranate Kabul (500g - 700g)

Output Needed:

Name = Fresh Value Colocasia, Quantity = 250g

Name = Fresh Value Banana Robusta, Quantity = 1kg

Name - Fresh Value Raw Papaya, Quantity = 1 U (units) (300g-400g)

It has hundreds of items like this . I have tried using

str.split()

but didn't get the output. I have also tried using regex but not sure how that works. Is there a way in which i can split the string after i find a number in string ? Any suggestions might help.

Upvotes: 1

Views: 364

Answers (3)

Yicen Tian
Yicen Tian

Reputation: 31

import re
def substring(string):
    output = {}
    name = string.split()[0]
    for i in range(1,len(string.split())):
        if len(re.findall('\d', string.split()[i]))==0:
            name = name + " " + string.split()[i]
        else:
            quantity = " ".join(string.split()[i:])
            break
    output["Name"] = name
    output["Quantity"] = quantity
    return output

Then put strings into this function like:

substring("Fresh Value Raw Papaya 1 U (units) (300g-400g)")

And you will get:

{'Name': 'Fresh Value Raw Papaya', 'Quantity': '1 U (units) (300g-400g)'}

Upvotes: 1

Joe
Joe

Reputation: 12417

One option (according to the data samples that you provided) can be this:

import re
strings = ['Fresh Value Colocasia 250g', 'Fresh Value Banana Robusta 1kg', 'Fresh Value Raw Papaya 1 U (units) (300g-400g)','Fresh Value Premium Pomegranate Kabul (500g - 700g)']
for i in strings:
    start = re.findall('\d|\(', i)[0]
    name = i.split(start)[0].strip()
    quantity = start + i.split(start)[1]
    print 'Name = '+ name + ', Quantity = ', quantity

Output:

Name = Fresh Value Colocasia, Quantity =  250g
Name = Fresh Value Banana Robusta, Quantity =  1kg
Name = Fresh Value Raw Papaya, Quantity =  1 U (units) (300g-400g)
Name = Fresh Value Premium Pomegranate Kabul, Quantity =  (500g - 700g)

Of course it is valid if the numbers and parenthesis are present only in the quantity and not in the name. If the quantity starts with other symbols, you can add them in findall

Upvotes: 1

user2906838
user2906838

Reputation: 1178

you can also try this:

def split_unit(stri):
    to_split = re.findall("\\d+",stri)[0]
    splitted = to_split + stri.split(to_split,1)[1]
    print(splitted)

split_unit("Fresh Value Colocasia 250g") #outputs : 250 g
split_unit("Fresh Value Banana Robusta 1kg") #outputs : 1Kg
split_unit("Fresh Value Raw Papaya 1 U (units) (300g-400g)") # outputs:1 U 
                                     #(units) (300g-400g) 

And so on, What I've done is, first find the first occurrence of a intiger in your string by using regex in the first line inside the function. And use str.split() method to split all the characters after the first integer and conctatinging it with the to_split which is the first integer.

Upvotes: 0

Related Questions