Reputation: 1
I am working on a project of web crawler using Selenium
with Python 3.6
in a Jupyter Notebook
.
The goal is to grab the reviews and their corresponding dates and ratings of an APP.
The target webpage is
I can get the reviews but I failed to grab their dates and ratings.
The code I used to grab the reviews is shown below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = 'https://play.google.com/store/apps/details?id=io.silvrr.installment&hl=en_US&gl=US&showAllReviews=true'
path = r"D:\Term 1\Cloud Computing\Session 4-5\CC_ETL\chromedriver"
browser = webdriver.Chrome(path) #Obrir un navegador Chrome
browser.get(url)
content = browser.find_elements_by_class_name('UD7Dzf')
User_Review = [] #Create an empty list to store the user reviews
for i in (range(0,40)):
first = {'Review -{}'.format(i+1):content[i].text}
User_Review.append(first)
User_Review
I then tried to grab the date with the class name "p2TkOb"
:
date = browser.find_elements_by_class_name('p2TkOb')
but it failed to return the specific dates. Instead, it returned some web elements shown below:
[<selenium.webdriver.remote.webelement.WebElement (session="b0e4c4f85a982c03a64302931a3474d1", element="dd95dfb2-c8da-47fc-8d5e-16a0ce010db7")>,
<selenium.webdriver.remote.webelement.WebElement (session="b0e4c4f85a982c03a64302931a3474d1", element="7fd1ab9a-b654-4ec0-b6c7-e8bd5260d0db")>,
Also, there two kinds of dates, one of the users and another for the developer, whose class names are the same. However, I only aim to grab the dates of user reviews.
I also got troubled in locating the element of ratings, for example, div aria-label="Rated 1 stars out of five stars"
.
Can anybody please help me? Thanks a lot!
Upvotes: 0
Views: 345
Reputation: 142631
It is not good idea to search separatelly content
and separatelly date
.
You should get div
which keeps both content
and date
and later use relative find_element_by_...
to get content
and date
in this div
. This way you get date
for this content
and you can control it.
I use 'd15Mdf bAhLNe'
to get divs which keep both content
and date
(every div
groups all in one review). And later I search content
and date
inside every div
separatelly - using item.find...
instead of browser.find...
- and I get single content
and singel date
and I'm sure that this date
is for this content
.
Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = 'https://play.google.com/store/apps/details?id=io.silvrr.installment&hl=en_US&gl=US&showAllReviews=true'
path = r"D:\Term 1\Cloud Computing\Session 4-5\CC_ETL\chromedriver"
#browser = webdriver.Chrome(path) #Obrir un navegador Chrome
browser = webdriver.Firefox()
browser.get(url)
cards = browser.find_elements_by_class_name('d15Mdf.bAhLNe') # it has to be `dot` in place of `space`
user_reviews = [] # Create an empty list to store the user reviews
for number, item in enumerate(cards, 1):
review = item.find_element_by_class_name('UD7Dzf').text
date = item.find_element_by_class_name('p2TkOb').text
user = item.find_element_by_class_name('X43Kjb').text
rated = item.find_element_by_xpath('.//div[@class="pf5lIe"]/div').get_attribute('aria-label').split(' ')[1]
user_reviews.append({
'number': number,
'review': review,
'date': date,
'user': user,
'rated': rated,
})
for item in user_reviews:
print('---', item['number'], '---')
print('date:', item['date'])
print('user:', item['user'])
print('rated:', item['rated'])
print('review:', item['review'][:50], '...')
Result:
--- 1 ---
date: January 9, 2021
user: NICASIO NICOSIA
rated: 1
review: I used to love this app because of the payment ter ...
--- 2 ---
date: February 3, 2021
user: Shitta Soewarno
rated: 1
review: After I updated the application I could no longer ...
--- 3 ---
date: January 25, 2021
user: Jennifer Jones
rated: 2
review: Used to love this alot and i always pay back on ti ...
--- 4 ---
date: February 5, 2021
user: Noy D Junior
rated: 3
review: Applications that helped out so much. But now I'm ...
--- 5 ---
date: January 31, 2021
user: Syuhada Husni
rated: 1
review: it suddenly logged me out and now i cannot log in ...
BTW: Selenium converts find_element_by_class_name('name')
to css selector
with dot
at the beginning - .name
- but it has problem with mulit names find_element_by_class_name('name1 name2')
. It should put dot
before every name and create .name1.name2
but it adds dot
only at the beginnig .name1 name2
so I add manually dot
between names in
find_elements_by_class_name('d15Mdf.bAhLNe')
BTW: It is not good idea to create unique keys Review-1
, Review-2
because later it is problem to get review
- you have to know what number to use in Review-{}
. It is better to use the same key review
in all items.
BTW: xpath
starts with dot
(.//div
) to create relative xpath which searchs only inside item
.
I put it on GitHub furas/python-examples
Upvotes: 1