turtle02
turtle02

Reputation: 613

python text parsing to get filtered output

My goal is to search file.txt to find a identifying string and then output the following words between the quotation marks.

So the identifier would be data-default-alt= and the name of the item is "Ford Truck" in quotes. I would like to output the name of the item and the price so that i can open it in excel.

data-default-alt="Ford Truck">       </h3>     </a>           </div>     <div class="tileInfo">                <div class="swatchesBox--empty"></div>                                                     <div class="promo-msg-text">           <span class="calloutMsg-promo-msg-text"></span>         </div>                              <div class="pricecontainer" data-pricetype="Stand Alone">               <p id="price_206019013" class="price price-label ">                  $1,000.00               </p> 

Desired Output would be

Ford Truck 1000.00

I am not sure how to go about this task.

Upvotes: 0

Views: 337

Answers (2)

Yavar
Yavar

Reputation: 11933

Well please construct more robust regular expressions for matching your cost and/or brand, here is some code to get you started.

str = '<data-default-alt="Ford Truck"></h3></a></div><div class="tileInfo"><div class="swatchesBox--empty"></div><div class="promo-msg-text"> <span class="calloutMsg-promo-msg-text"></span> </div><div class="pricecontainer" data-pricetype="Stand Alone"><p id="price_206019013" class="price price-label ">$1,000.00</p>'

import re

brand=re.search('<data-default-alt=\"(.*?)">',str)
cost=re.search('\$(\d+,?\d*\.\d+)</p>',str)
if brand:
        print brand.group(1)
if cost:
        print cost.group(1)

Upvotes: 1

illright
illright

Reputation: 4043

Use the default string methods to find the substring index. For example, "abcdef".find("bc") would return 1, which is the index of the first letter of the substring. To parse your string, you could look for tags and then extract the needed text using string slicing.
So this is an example of solving your problem, considering that the parsed string is being stored in a st variable:

with open("file.txt") as f:
    st = f.read() # that's to get the file contents
name_start = st.find('data-default-alt="') + len('data-default-alt="') # found the first letter's index and added the substring's length to it to skip to the part of the actual data
name_end = st[name_start:].find('"') # found the closing quote
name = st[name_start:name_start + name_end] # sliced the string to get what we wanted

price_start = st.find('class="price price-label ">') + len('class="price price-label ">')
price_end = st[price_start:].find('</p>')
price = st[price_start:price_start + price_end].strip().rstrip()

The results are in name and price variables. If you wanna work with the price as a number and don't want the dollar sign, add it to the strip arguments (.strip("$ "), read more on that method in Python docs). You can remove the comma by calling a replace(",", "") on the price string and after all, convert the string to a float using float(price)
Notes: it may just be the way you put the parsed string in, but I've added strip() and rstrip() methods to get rid of whitespaces on each end of the price string.

Upvotes: 0

Related Questions