anansi_mph
anansi_mph

Reputation: 1

Log In and Web Scrape with Python 3 but action='#' and possibly Java script

I am trying to us Python 3 to scrape my data from Ancestry.com using Beautifulsoup and Mechanicalsoup but I am running into a few issues trying to log in. Here is the form's HTML on Ancestry:

<form action="#" id="signInForm" method="post" class="form formLarge" onsubmit="return false" novalidate="novalidate" data-ui-id="ui1591467547206308">
            <div class="ancGrid">
                <div class="ancCol ancColRow w100">
                    <label id="usernameLabel" for="username" data-error-0="Required" data-error-1="Please enter a minimum of 5 characters for the username/email" data-error-2="Username/email contains invalid characters">
                        Email or Username
                    </label>
                    <input tabindex="1" aria-required="true" class="success required" id="username" maxlength="64" name="username" placeholder="Email Address or Username" type="text" value="" autocorrect="off" autocapitalize="off">
                </div>
                <div class="ancCol ancColRow w100">
                    <label id="passwordLabel" for="password" data-error-0="Required" data-error-1="Please enter a minimum of 5 characters for the password" data-error-2="Password contains invalid characters">
                        Password
                    </label> [event]
  1. The HTML form for the site uses action='#', which I've found means that inputs are submitted into the current webpage. Additionally, I see an [event], which states 'event listener', and I think this implies Java Script? If so, do I need a separate import tool to log in?
  2. Beautifulsoup cannot find the first form (of two forms). The second form has action="" which does appear.

    from urllib.request import urlopen
    # specify the url
    quote_page = 'https://www.ancestry.com/account/signin?'
    # query the website and return the html to the variable ‘page’
    page= urlopen(quote_page)
    
    # parse the html using beautiful soup and store in variable `soup`
    soup = BeautifulSoup(page, 'html.parser')
    len(soup.find_all('form')) #Out: 1
    
  3. How can I interact with form 1? When I use browser.select_form('form[action="#"]') I get the error LinkNotFoundError. My code:

#import urllib.request
#import time
#pip install beautifulsoup4
#from bs4 import BeautifulSoup
#%pip install mechanicalsoup
#import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
browser.open('https://www.ancestry.com/account/signin?')
print(browser.get_url())

#browser.select_form('')
###action="#" id="signInForm"
#browser.select_form('form[action="#" id="signInForm"]')
#browser.select_form('form[action="#"]')   #gives LinkNotFound error
browser.select_form('form[action=""]')


browser['username']='USERNAME'
browser['password']='PASSWORD'

browser.submit_selected()
print(browser.get_url())

I see a lot of support using mechanize but that does not work for Python 3. I do not know how to check if Ancestry.com is using Java or not, because I can't engage the first form. I am a beginner, so please assume I know nothing, and I won't be offended. (I haven't found a tutorial with action='#' because that query returns few results)

(This person used a different strategy to log into Ancestry, but the site has updated since this code was posted https://github.com/freeseek/getmydnamatches/blob/master/getmyancestrydna.py His code is a little too advanced for me, at my level.)

Upvotes: 0

Views: 545

Answers (1)

Geraldo Castro
Geraldo Castro

Reputation: 199

Please, consider taking a look at this: https://requests.readthedocs.io/projects/requests-html/en/latest/

It's very friendly and has javascript support.

Upvotes: 0

Related Questions