Reputation: 613
My goal is to search file.txt to find a identifying string and then output the following words between the quotation marks.
So the identifier would be data-default-alt= and the name of the item is "Ford Truck" in quotes. I would like to output the name of the item and the price so that i can open it in excel.
data-default-alt="Ford Truck"> </h3> </a> </div> <div class="tileInfo"> <div class="swatchesBox--empty"></div> <div class="promo-msg-text"> <span class="calloutMsg-promo-msg-text"></span> </div> <div class="pricecontainer" data-pricetype="Stand Alone"> <p id="price_206019013" class="price price-label "> $1,000.00 </p>
Desired Output would be
Ford Truck 1000.00
I am not sure how to go about this task.
Upvotes: 0
Views: 337
Reputation: 11933
Well please construct more robust regular expressions for matching your cost and/or brand, here is some code to get you started.
str = '<data-default-alt="Ford Truck"></h3></a></div><div class="tileInfo"><div class="swatchesBox--empty"></div><div class="promo-msg-text"> <span class="calloutMsg-promo-msg-text"></span> </div><div class="pricecontainer" data-pricetype="Stand Alone"><p id="price_206019013" class="price price-label ">$1,000.00</p>'
import re
brand=re.search('<data-default-alt=\"(.*?)">',str)
cost=re.search('\$(\d+,?\d*\.\d+)</p>',str)
if brand:
print brand.group(1)
if cost:
print cost.group(1)
Upvotes: 1
Reputation: 4043
Use the default string methods to find the substring index. For example, "abcdef".find("bc")
would return 1, which is the index of the first letter of the substring. To parse your string, you could look for tags and then extract the needed text using string slicing.
So this is an example of solving your problem, considering that the parsed string is being stored in a st
variable:
with open("file.txt") as f:
st = f.read() # that's to get the file contents
name_start = st.find('data-default-alt="') + len('data-default-alt="') # found the first letter's index and added the substring's length to it to skip to the part of the actual data
name_end = st[name_start:].find('"') # found the closing quote
name = st[name_start:name_start + name_end] # sliced the string to get what we wanted
price_start = st.find('class="price price-label ">') + len('class="price price-label ">')
price_end = st[price_start:].find('</p>')
price = st[price_start:price_start + price_end].strip().rstrip()
The results are in name
and price
variables. If you wanna work with the price as a number and don't want the dollar sign, add it to the strip arguments (.strip("$ ")
, read more on that method in Python docs). You can remove the comma by calling a replace(",", "")
on the price string and after all, convert the string to a float using float(price)
Notes: it may just be the way you put the parsed string in, but I've added strip()
and rstrip()
methods to get rid of whitespaces on each end of the price string.
Upvotes: 0