Reputation: 564
I used a web crawler to get some data. I stored the data in a variable price
. The type of price
is:
<class 'bs4.element.NavigableString'>
The type of each element of price
is:
<type 'unicode'>
Basically the price
contains some white space and line feeds followed by: $520
. I want to eliminate all the extra symbols and recover only the number 520
. I already did a naive solution:
def reducePrice(price):
key=0
string=""
for i in price:
if (key==1):
string=string+i
if (i== '$'):
key=1
key=0
return string
But I want to implement a more elegant solution, transforming the type of price
into str
and then using str
methods to manipulate it. I already searched a lot in the web and other posts in the forum. The best I could get was that using:
p = "".join(price)
I can generate a big unicode variable. If you can give me a hint I would be grateful (I'm using python 2.7 in Ubuntu).
edit I add my spider just in case you need it:
def spider(max_pages):
page = 1
while page <= max_pages:
url = "http://www.lider.cl/walmart/catalog/product/productDetails.jsp?cId=CF_Nivel2_000021&productId=PROD_5913&skuId=5913&pId=CF_Nivel1_000004&navAction=jump&navCount=12"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
title = ""
price = ""
for link in soup.findAll('span', {'itemprop': 'name'}):
title = link.string
for link in soup.find('em', {'class': 'oferLowPrice fixPriceOferUp '}):
price = link.string
print(title + '='+ str(reducePrice(price)))
page += 1
spider(1)
edit 2 Thanks to Martin and mASOUD I could generate the solution using str
methods:
def reducePrice(price):
return int((("".join(("".join(price)).split())).replace("$","")).encode())
This method return an int
. This was not my original question but it was the next step in my project. I added it because we can't cast unicode into int
but using encode() to generate a str
first, we can.
Upvotes: 1
Views: 471
Reputation: 59671
Use a RegEx to extract the price from your Unicode string:
import re
def reducePrice(price):
match = re.search(r'\d+', u' $500 ')
price = match.group() # returns u"500"
price = str(price) # convert "500" in unicode to single-byte characters.
return price
Even though this function converts Unicode to a "regular" string as you asked, is there any reason you want this? Unicode strings can be worked with the same way as a regular string. That is u"500"
is almost the same as "500"
Upvotes: 2