Reputation: 3
took the advice and i was able to pass the original error, thank you all so much so far :) i'm almost where i want to be. seems i still have a massive knowledge gap when it comes to indenting. you guys are truely a gem to the coding community, thank you so much so far :)
Here is the current code that has passed those errors and its down to a warning, and not extracting anything.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://dc.urbanturf.com/pipeline'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
pipeline_items = soup.find_all('div', attrs={'class': 'pipeline-item'})
rows = []
columns = ['Listing Title', 'Listing url', 'listing image url', 'location', 'Project type', 'Status', 'Size']
for item in pipeline_items:
# title, image url, listing url
listing_title = item.a['title']
listing_url = item.a['href']
listing_image_url = item.a.img['src']
for p_tag in item.find_all('p'):
if not p_tag.h2:
if p_tag.text == 'Location:':
p_tag.span.extract()
property_location = p_tag.text.strip()
elif p_tag.span.text == 'Project type:':
p_tag.span.extract()
property_type = p_tag.text.strip()
elif p_tag.span.text == 'Status:':
p_tag.span.extract()
property_status = p_tag.text.strip()
elif p_tag.span.text == 'Size:':
p_tag.span.extract()
property_size = p_tag.text.strip()
row = [listing_title, listing_url, listing_image_url, property_location, property_type, property_status, property_size]
rows.append(row)
df = pd.Dataframe(rows, columns=columns)
df.to_excel('DC Pipeline Properties.xlsx', index=False)
print('File Saved')
the error that i get is the following im using pycharm 2020.2 maybe its a bad choice?
row = [listing_title, listing_url, listing_image_url, property_location, property_type, property_status, property_size] NameError: name 'property_location' is not defined
Upvotes: 0
Views: 1717
Reputation: 3
Mission Accomplished thanks to everyone here, Cheers! few things i was missing. 1 Indenting for sure. 2 i was missing a span on the first subsection -- if p_tag.span.text == 'Location:': 3 i was missing a package openpyxl which was called at the bottom to write to excel.
100% working code below, and my promise to get better and help out when i can :)
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://dc.urbanturf.com/pipeline'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
pipeline_items = soup.find_all('div', attrs={'class': 'pipeline-item'})
rows = []
columns = ['listing title', 'listing url', 'listing image url', 'location', 'Project type', 'Status', 'Size']
for item in pipeline_items:
# title, image url, listing url
listing_title = item.a['title']
listing_url = item.a['href']
listing_image_url = item.a.img['src']
for p_tag in item.find_all('p'):
if not p_tag.h2:
if p_tag.span.text == 'Location:':
p_tag.span.extract()
property_location = p_tag.text.strip()
elif p_tag.span.text == 'Project type:':
p_tag.span.extract()
property_type = p_tag.text.strip()
elif p_tag.span.text == 'Status:':
p_tag.span.extract()
property_status = p_tag.text.strip()
elif p_tag.span.text == 'Size:':
p_tag.span.extract()
property_size = p_tag.text.strip()
row = [listing_title, listing_url, listing_image_url, property_location, property_type, property_status, property_size]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
df.to_excel('DC Pipeline Properties.xlsx', index=False)
print('File Saved')
Upvotes: 0
Reputation: 36
Line 17 and below needs to be inside the for loop for 'item' to be seen.
for item in pipeline_items:
# title, image url, listing url
listing_title = item.a['title']
listing_url = item.a['href']
listing_image_url = item.a.img['src']
for p_tag in item.find_all('p'): <------------Indent this for loop to be inside the previous for loop.
if not p_tag.h2:
if p_tag.text == 'Location:':
Upvotes: 0
Reputation: 3648
The problem is that
pipeline_items = soup.find_all('div', attrs={'class': 'pipline-item'})
returns an empty list. The result of this is that:
for item in pipeline_items:
Never actually happens. Because of this the value of item
is never defined.
I'm not sure exactly what you're trying to do. But I see two solutions:
for p_tag in item.find_all('p'):
so that you execute it for every item. This way, if there are no items, it's not called (I think this is what you intended to do originally?)item
exists, and skip the loop if it doesn't. Which most closely copy what you're code is currently doing, but I don't think that's what you want it to do.Upvotes: 1
Reputation: 164
Seems to me that your second for loop for p_tag in item.find_all('p'):
is outside of the scope of the 1st for loop that iterates over items... Add that to the fact there might be 0 items in 1st loop, you get a None.
Just put the for loop and its content inside the for loop that iterates over items in pipeline_items.
Upvotes: 1