Reputation: 491
I downloaded the HTML from this page MANUALLY (CTRL+S):
I downloaded the .HTML file and parsed it with the following code:
from bs4 import BeautifulSoup
with open('/content/drive/My Drive/Colab Notebooks/Projects/20200710_StreetEasy_WebScraping/a.mhtml') as f:
contents = f.read()
#parser
soup = BeautifulSoup(contents, 'html') #'lxml-xml', 'lxml', 'html5lib', 'html'
print(soup)
The output is in a single line:
<!-- saved from url=(0143)https://streeteasy.com/for-sale/nyc/area:112,115,110,103,117,104,158,113,116,108,109,162,107,106,105,157,121,120,123,122,124,143,141,137?page=2 --><html><head><meta content="text/html; ch
When finding all the a tags, it works:
a=soup.find_all('a')
a
[<a class='3D"html-attribute-value' href='=3D"https://cdn-assets-s3.streeteasy.com/assets/manifest-c93475b02bd2409b4a=' html-resource-link="" noop='ener"' rel='3D"noreferrer' target='3D"_blank"'>//cdn-assets-s3.streeteasy.com/assets/manifest-c93475b02bd2409b4a52e2=
1af023e5d5f489f19500d234a3660fe4d35069bbac.json</a>,
<a class='3D"html-attrib=' href='3D"https://browser.sen=' html-resource-link="" noopener="" rel='3D"noreferrer' target='3D"_blank"' try-cdn.com="" ute-value="">https://brows=
er.sentry-cdn.com/5.19.0/bundle.min.js</a>,
...
When searching for div, scripts, meta... is all blank:
div=soup.find_all('div')
div
[]
Is this a parsing problem?
Upvotes: 0
Views: 409
Reputation: 17368
The website in the question is a pretty good website. I opened the website and opened the view source. I copied the html and pasted the html to a file.
Link - view-source:https://streeteasy.com/for-sale/nyc/area:112,115,110,103,117,104,158,113,116,108,109,162,107,106,105,157,121,120,123,122,124,143,141,137?page=2
I got the information on the page as a json.
from bs4 import BeautifulSoup
import json
html = open("html.html").read()
soup = BeautifulSoup(html, "lxml")
json_text = soup.find("script", {"type":"application/ld+json", "async":"async"}).text.strip()
json_obj = json.loads(json_text[json_text.index("{")-1:-6])
Output:
[{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$3,475,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '15 East 30th Street',
'postalCode': '10016',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/2/381345902.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$849,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '463 West 57th Street',
'postalCode': '10019',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/55/394819655.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$1,475,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '160 West 66th Street',
'postalCode': '10023',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/7/396195007.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$2,799,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '470 West 24th Street',
'postalCode': '10011',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/25/396194325.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$795,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '420 East 55th Street',
'postalCode': '10022',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/29/396194129.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$816,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '258 West 93rd Street',
'postalCode': '10025',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/34/396194034.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$849,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '464 West 44th Street',
'postalCode': '10036',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/96/396192696.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$1,495,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '310 West 52nd Street',
'postalCode': '10019',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/45/396191645.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$2,725,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '50 Riverside Boulevard',
'postalCode': '10069',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/48/396190448.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$1,298,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '325 Fifth Avenue',
'postalCode': '10016',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/31/396187231.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$670,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '303 East 57th Street',
'postalCode': '10022',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/7/396187207.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$629,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '520 East 76th Street',
'postalCode': '10021',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/50/396186150.jpg'}},
{'@context': 'http://schema.org',
'@type': 'ApartmentComplex',
'additionalProperty': {'@type': 'PropertyValue', 'value': '$20,500,000'},
'address': {'@type': 'PostalAddress',
'addressRegion': 'NY',
'addressLocality': 'Manhattan',
'streetAddress': '435 Broome Street',
'postalCode': '10013',
'addressCountry': {'@type': 'Country', 'name': 'USA'}},
'photo': {'@type': 'CreativeWork',
'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/98/396186098.jpg'}}]
Upvotes: 1