Reputation: 1817
I am trying to web scrape web pages about flats in Prague and create a dataframe for each flat, that would show number of rooms, proce, coordinates etc.
I am able to perform basic scraping, but eventually I end up with list that I cannot filter properly.
I would like to ask for any advice, is my approach good?
import requests
import pandas as pd
a = []
numberOfPages = 3
for page in range(numberOfPages + 1):
url = "https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=10&page="+str(page)+"&per_page=1&tms=1583500044717"
print(url)
resp = requests.get(url)
a.append(resp.json())
a[0]['_embedded']["estates"]
from list a
I would like to create a data frame but using simple pd.Dataframe(a)
return a data frame that has list inside it
is there a better way hot to perform scraping and then create dataframe with characteristics such as number of rooms, price, coordinates, etc.
Upvotes: 1
Views: 86
Reputation: 10184
You're on a good way. You can extend your code with this to get a dataframe:
# for older versions of pandas import json_normalize like so:
# from pandas.io.json import json_normalize
# use this for pandas version 1.x
from pandas import json_normalize
frames = []
for idx in range(len(a)):
for estate in (a[idx]["_embedded"]["estates"]):
frames.append(json_normalize(estate))
df_estates = pd.concat(frames)
df_estates.info()
Output:
Int64Index: 20 entries, 0 to 0
Data columns (total 96 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 labelsReleased 20 non-null object
1 has_panorama 20 non-null int64
2 labels 20 non-null object
3 is_auction 20 non-null bool
4 labelsAll 20 non-null object
5 category 20 non-null int64
6 has_floor_plan 20 non-null int64
7 paid_logo 20 non-null int64
8 locality 20 non-null object
9 has_video 20 non-null bool
10 new 20 non-null bool
11 auctionPrice 20 non-null float64
12 type 20 non-null int64
13 hash_id 20 non-null int64
14 attractive_offer 20 non-null int64
15 price 20 non-null int64
16 rus 20 non-null bool
17 name 20 non-null object
18 region_tip 20 non-null int64
19 has_matterport_url 20 non-null bool
20 seo.category_main_cb 20 non-null int64
21 seo.category_sub_cb 20 non-null int64
22 seo.category_type_cb 20 non-null int64
23 seo.locality 20 non-null object
24 _embedded.favourite.is_favourite 20 non-null bool
25 _embedded.favourite._links.self.profile 20 non-null object
26 _embedded.favourite._links.self.href 20 non-null object
27 _embedded.favourite._links.self.title 20 non-null object
28 _embedded.note.note 20 non-null object
29 _embedded.note._links.self.profile 20 non-null object
30 _embedded.note._links.self.href 20 non-null object
31 _embedded.note._links.self.title 20 non-null object
...
Upvotes: 1