pandalai
pandalai

Reputation: 426

Python3 Selenium Issue

I want to scrape some comments via a Web page. When I try to choose the goto button(change to next page) via Selenium, it always shows a pop-up window. I have tried to close the pop-up window using Selenium, but it still doesn't work. Could someone help me fix this issue and help me complete the next_page() function below? Many thanks!

I have already complete the Function scrape_comments() . What I want to do is to complete Function next_page().

Here is my code.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By

# url
url = "https://hotels.ctrip.com/hotel/347422.html?isFull=F#ctm_ref=hod_sr_lst_dl_n_1_8"

# User Agent
User_Agent_List = ["Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2",
                   "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
                   "Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)",
                   "Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11",
                   "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2",
                   "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1",
                   "Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25"]

# Define the related lists
Score = []
Travel_Types = []
Room_Types = []
Travel_Dates = []
Comments = []

DEFINE_PAGE = 10

def next_page():
    """
    It is a function to execute Next Page function
    """
    current_page = int(browser.find_element_by_css_selector('a.current').text)

    # First, clear the input box
    browser.find_element_by_id("cPageNum").clear()
    print('Clear the input page')

    # Second, input the next page
    nextPage = current_page + 1
    print('Next page ',nextPage)
    browser.find_element_by_id("cPageNum").send_keys(nextPage)
    
    # Third, press the goto button
    WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="cPageBtn"]')))
    browser.find_element_by_xpath('//*[@id="cPageBtn"]').click()


def scrape_comments():
    """
    It is a function to scrape User comments, Score, Room types, Dates.
    """
    html = browser.page_source
    soup = BeautifulSoup(html, "lxml")
    scores_total = soup.find_all('span', attrs={"class":"n"})
    # We only want [0], [2], [4], ...
    travel_types = soup.find_all('span', attrs={"class":"type"})
    room_types = soup.find_all('a', attrs={"class":"room J_baseroom_link room_link"})
    travel_dates = soup.find_all('span', attrs={"class":"date"})
    comments = soup.find_all('div', attrs={"class":"J_commentDetail"})
    # Save score in the Score list
    for i in range(2,len(scores_total),2):
        Score.append(scores_total[i].string)
    Travel_Types.append(item.text for item in travel_types)
    Room_Types.append(item.text for item in room_types)
    Travel_Dates.append(item.text for item in travel_dates)
    Comments.append(item.text.replace('\n','') for item in comments)

if __name__ == '__main__':

    # Random choose a user-agent
    user_agent = random.choice(User_Agent_List)
    print('User-Agent: ', user_agent)

    # Browser options setting
    options = Options()
    options.add_argument(user_agent)
    options.add_argument("disable-infobars")

    # Open a Firefox browser
    browser = webdriver.Firefox(options=options)
    browser.get(url)

    #### My ISSUE #####
    browser.find_element_by_xpath('//*[@id="appd_wrap_close"]').click()

    page = 1    
    while page <= DEFINE_PAGE:
        scrape_comments()
        next_page()
    
    browser.close()

Thanks in advance!

Upvotes: 2

Views: 159

Answers (2)

pandalai
pandalai

Reputation: 426

Thanks to Peck's guidance, I can complete the next_page() Function. However, the pop-up window is a browser fingerprinting technique to track user via Web. We don't have an idea on how to bypass the tracking technique now. The code below is the next_page() I have completed.

def next_page(page):
    """
    It is a function to execute Next Page function
    param: page. # Integer, it depends on what page you want to change to.
    """
    retryNum = 5

    while retryNum >= 0:
        try:
            # page is the page you see right now, what you wanna do is to change to the next page.
            page = page + 1
            # Clear
            browser.find_element_by_id("cPageNum").clear()
            # Send keys
            browser.find_element_by_id("cPageNum").send_keys(page)
            # Click goto button
            browser.find_element_by_id("cPageBtn").click()
            # Sleep for random seconds as waiting for loading
            time.sleep(random.randint(15, 25))
            # Check current page
            currentPage = int(browser.find_element_by_css_selector('a.current').text)

            if currentPage != page:
                retryNum -= 1
                print('Retry!')
                continue
            else:
                break
        except Exception as e:
            assert 'Failed to change to next page'
            return False

Upvotes: 0

C. Peck
C. Peck

Reputation: 3711

OK so It really seemed like there must be some window you have to switch into to get Selenium to click on the '//*[@id="appd_wrap_close"]' element and I tried for awhile to find one. Eventually I think I stumbled upon what's preventing you from clicking that element. They have test tracking software in place. Here's how I found that out: first all I did was the obvious, inspect the 'x' element you were trying to click. I tried to find anything unusual about that element and after a bit I noticed there was an Event attached to it. I click on the Event in firefox's inspector and saw the following:

'//*[@id="appd_wrap_close"]'

hmm... I expected it to just close the box it is in but it has the following JavaScript:

function() {
  c.setCookie({
    manualclose: "1"
  }, "", 1), u.collapse(), window.__bfi.push(["_tracklog", "pcfloatClose", location.href + "&urlPageId=" + e + "&htmlType=" + d])
}

Well, there's u.collapse, which I'd guess is all the code needed to collapse the panel. But why all this other stuff? A couple things seemed odd to me: why does it set a cookie every time you click that button? And why is it called **manual**close? Then I looked a little closer and saw the text following "click": _esUnionOnline/R3/float/floating_normal.min.js?20190316:2. Hm. So they are calling a javascript file. And that looks to be a URL. Why are they going to all this trouble for a mouseclick event on that little 'x'?

I mouse over it, and sure enough, it shows me https://webresource.c-ctrip.com/ResUnionOnline/R3/float/floating_normal.min.js?20190306:2.

I navigate to that website and find a large file containing minified Javascript. I put it through an un-minifier (I used https://unminify.com/). Right at the top of the document I see

document.getElementById("ab_testing_tracker") && "abTestValue_Value" != h ? 
document.getElementById("ab_testing_tracker").value

ab_testing_tracker..... that doesn't sound good. So I do a search on that and find a bunch of hidden inputs with the id ab_testing_tracker. At this point I'm pretty convinced they are detecting selenium and not letting you click that. After a bit of googling on common test tracking methods and find that, among some other things, checking the userAgent was common. Selenium's default userAgent is just webdriver as you can read here, so I did a search for that. Sure enough, there are 20 results all in the form of navigator.userAgent, and some that look like

i.test(navigator.userAgent)

Then I noticed you are using a random, legitimate userAgent so they must have some other way of detecting selenium. I did notice this function

function n() {
    var t, n;
    switch (n = e.ResponseStatus.Errors[0].ErrorCode ? e.ResponseStatus.Errors[0].ErrorCode : "") {
        case "104":
            t = "验证码输入超时";
            break;
        case "105":
            t = "验证码输入错误";
            break;
        case "106":
            t = "手机号码不正确";
            break;
        case "107":
            t = "客户端IP不能为空";
            break;
        case "108":
            t = "短信内容不能为空";
            break;
        case "109":
            t = "同一号码,两分钟内最多发一次";
            break;
        case "110":
            t = "一天内同一手机最多发两次";
            break;
        case "111":
            t = "一天内同一IP最多发五次";
            break;
        default:
            t = "短信发送失败,请重新发送"
    }
    return t
}

in their javascript and with the help of google translate found out that the last few switches are checking if you've accessed the site a certain number of times. But unfortunately I couldn't come up with a real way around this... in Firefox at least.

If you are willing to test in Chrome, that box starts minimized by default (for whatever reason) so you don't have to worry about getting rid of it.

So, long story short, if you can test in Chrome, you can simply delete the following line and not worry about the stupid box. The test tracker might still, I don’t know how it works and I suspect they have written it themselves as I can’t find any tool that uses these “ab_test_tracking” nodes—In fact, a google search on “ab_test_tracking” yields few results and most of them were this very website.

Let me know what your options are (do you need to use Firefox for some reason?) and if you are able to run the test in Chrome, let me know if it works!

Edit regarding pagination button So I found that the same thing is true about the button you are trying to click to navigate to the next page--it has an onClick event that also links to a huge minified file with test tracking, so I'm thinking that's why you can't click on your button and selenium never gets past the first page.

But the "Next" button does not have a script that it invokes on click. You should be able to click that button with

browser.find_element_by_xpath('//*[@id="divCtripComment"]/div[4]/div/a[2]')

let me know if that works for you.

Upvotes: 1

Related Questions