Kev
Kev

Reputation: 23

Scraping a website using inputs with Selenium and BeautifulSoup?

I am trying to scrape the wester union send money-Website in order to get the current "euro-blue" exchange rate with the Argentinian pesos. Western Union is the only company that gives you the true exchange rate that is also traded on the streets. Look up Dollar-Blue in case you are interested how a second market developed for trading currencies in Argentina.

My goal is to get the current exchange rate of the Euro to the Argentinian pesos. If one goes onto the website, you have to first click the Accept Button, then type in the Name of the Country where you would like to send the money to and only after that step you can see the exchange rate.

I was trying it first with requests, since this doesn't handle java-script I switched to selenium and are pretty close now.

My code looks as follows:

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

WesternUnion = 'https://www.westernunion.com/de/en/web/send-money'

# create a new Chrome session
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(WesternUnion)

python_button = driver.find_element_by_id('button-fraud-warning-accept')
python_button.click()

time.sleep(0.25)
python_button = driver.find_element_by_id('country')
python_button.click() #click fhsu link
time.sleep(0.15)
text_area = driver.find_element_by_id('country')
text_area.send_keys("Argentina")

soup = BeautifulSoup(driver.page_source, 'lxml')

div = soup.find('div', id="OptimusApp")
div2 = soup.find('div', class_="text-center")

The problem is that it doesn't show the exchange rate if I do it with python (screenshot navigated automatic with python) whereas it does show the exchange rate if I do exactly the same thing by hand (screenshot navigated by hand).

I am very new to scraping and python, does anyone have a simple solution for this problem?

Upvotes: 2

Views: 767

Answers (3)

dantebarba
dantebarba

Reputation: 1674

Just to add a little bit of informatión about this. I managed to make it work for a little while by using the westernunion.ru node which looks like wasn't protected (because I could get this information without all these headers) Unfortunately westernunion.ru endpoint has been taken down or at least is not working anymore. So a solution could be to find an endpoint for the API that is not protected yet.

Upvotes: 0

drzraf
drzraf

Reputation: 511

Change rate comes from https://www.westernunion.com/wuconnect/prices/catalog with a POST request. Eg:

  • Assuming a $payload variable containing:
{
  "header_request": {
    "version": "0.5",
    "request_type": "PRICECATALOG",
    "correlation_id": "web-x",
    "transaction_id": "web-x"
  },
  "sender": {
    "client": "WUCOM",
    "channel": "WWEB",
    "cty_iso2_ext": "DE",
    "curr_iso3": "EUR",
    "funds_in": "*",
    "send_amount": 300,
    "air_requested": "Y",
    "efl_type": "STATE",
    "efl_value": "CA"
  },
  "receiver": {
    "curr_iso3": "ARS",
    "cty_iso2_ext": "AR",
    "cty_iso2": "AR"
  }
}
  • And assuming an innocent user-agent
  • Then curl -s 'https://www.westernunion.com/wuconnect/prices/catalog' --data-raw "$payload" | jq '.services_groups[0].pay_groups[0] | .fx_rate' would get it.

It used to work (until a couple of weeks ago).

BUT the endpoint is now protected: It expects a custom set of crypto headers computed from the browser and relying heavily upon obfuscated and involved Javascript. Here is what they look like:

X-NYUPe9Cs-a: IExHQTfwEnWwuyWbWjmR2fyBEQW9X9nnqFqIio78zzCKFA78iBDudN=NnOpQd=725d_urqfAN2sKK7UOdTnkCpUqFvQ9TF2nK=M1jDmrMBYy-4iq5kUqSdEN1PjBjEC=Nx742P1np7qAKK8q8qWd5UQIQ8Wqnqx51np7kIavPFenB9dSvnKou0A2nfv7qE-q7k_2EdNyuKffAYxcqbnjnCYIDfe=IKCc8JdPzpDecynafP1fVKq=z2SJCKiaMXu-Dxp2z5CpfznOPcs4WFH2D4C5JTTnDDUQ7vOPFVKnKCdcamPqOnK8wOQb9FYoxWs=Pksn4vmeC5Ia9EoVReH8uj0q_PRu2q522kk-9jnRTYJIP9VWP_50hhxPMds9eX_kAC2DbBnKzy24sICkO7bkkyAT82s5YuKECP=fnzXixxC8=81WX4jqnNBJ_qxbbqV=InUWmKYWimbUaB5qwOCA2iqSXNDw25PmHq8_2XEAx7nTnjkwYS2qvNBa8sAjxxHU8ibNFr_iiZH=4JuS2Q=RJrnTDonA1vFxKe812s-CMJ8HFay0VqrC2kQZVzCV2w0bqZyEuJksehxE22W8-Smd5V5XnvENHFcn72wkeN=boc=PIbv=XYNqEknrCyEX2r8BJvYCipnKdnkohrIvPovqfJMB7emybSTy2Eeu9h9VBrqYMW2NrXb2wc1kxC5WJAFv_cXE_vqsvRqeS-wYJ9vD1Y-1Cvo8RRqkFWAXuq1CBYXndSQ_A1e0aqO7sTB=nyKFd1=rJ4=z15z-qFMEQfy_x=qedJTzvWf8SE9yMqVCYUuSrhMnpEFdeJYiEdX-KS2In0-uZ0zzrn2qn27zY-jo7qkrvrq8V8v2aACd7PFEnMbCyUUUI-MdTcD8nCDiC2yuPOpbUcwID7Y1d=2aIubdAhErSn82C9FnSm9IVj8Z_WHwBvBPCI_o=_2pdRVk0jS5qYb_OjyVrrxqXnZOp9TVnAVnWZOWn798a8qhX-hYuFjJ-z84rzQRo2M70vHAMuNSMT_8yqkrujEr7JcyU2CmY1NKpev0w8R19227=qVqdemsq00nx-UAYz0=UYA2hT2IaqoqRie7Jbzjikb2snnnQynoHUpnYxRVs9ORc7I2MVhqqCVonnVk5Pi1xns2--iqqSKH8Rhium-nRcWurBu=TFiZ-5Qq-_WDiMQ5n7BqmAZkjWZM97MNkqakw8nq9CXav2fq4OqUok997VTOFkP7DEm-W5ckkwInQNMBNqTrK25DnSHRiyP5m5zqh1RjWp48f_9QCO2HiPS9A8j58zoF_8abn0H1qUERd_Cq8-7zqOnkEeAAWCywi18wUD5qfbQd22BJDNq90sMSbNVsJy0P2CBf-hq9fjSCB=uA5y8xT2-CJunFwUCx85ujxiq-bu5BAbSpqUCAXDP8iq02ET5-xRq7CD22n=E4keqVnKpzq2=RUKWP_jDnsiKRn4xxsRM0QYnbCC=m2KjCE9BjJ1nrn8EDvUS52bmaixqosRq5SNOPEHKyrQy8nqI9E9OAMYm5=TpVNvn-oqeDF_-jkcqIdyHqn1QYxaZbn4xVFqIOzQ9eV7A9QbC5zPcPeD=qqpqqK=YxNzKwTSCOnA70SrhiB2r1VkqKuuBJQYZoIC_87Mmuo8znpQnH29fI7Oh99sKO5aoEQIMOrIDwQDZvWqwwH=ZKnnn8T=5o9MTdDkpr472DPdqOEq8Ffii0q00r8OwkZX_oXY2UEKdCaX88zZamSqaY8iZzqiIYdeMjqMFKqVAv-82PxBWQv1Kr1OibYSh0QTp14BqBhEf-WKrVECI_y7517nZa8ndFpjznkfcnY2KufY0iFwnx2zx99iuUbF84nerZH88Rxx=pKBbsjeqJZ-0xZScnrn9hReJ--oh40mcxMXn1V0PzwcMaEACo0dWouDZeZYHViqd9RQAnso2DIF-wI-Pe_q5srKK8nmCZNI2hZqwjzOM7bwF4_4-S=9BzYFDaYw0SknMJTq9VReaM297ir-CYsdM9VN29TpDRnC=8aQ5o9yXZpEDyfqmJuwzs7N7he8FPrfIdDVK5iaW8Jm8YcHnqnno7EHSqKeTRNuzkeHqcn0u87OX=ByhQMQJ4QacaxqqFVmPqQEHSVbx1PsQDq780PWDKbvK5PBMnZksBZm0VIOHxu_q2xnfPWsixuqaIm2sXn2Jz2yByvdNeT5r2F14zEaiiEFfNqICZ_DHCXpr2K4HURNd5n_vyJTe2UVakZE_9T01W9cFUxBOur0xfN0=h4vmOoUAnwISSDxc5EmAefWviW2PvqevpnnS7YuMPMY5aHi2c2RrP=i-mfPpKzRSHpAn82sJ9izMdWcWq=qI5O_UBm==vFHrFOzHQK8AH9qcRM8=KHpwyoV-b0WzuErxZhZmMV_iKors2JCAeWn-jn-q_Mrqau1Xz88nTBQFO=vnKPfFoqY9Z81KUqyAn2N5dwbnKWHUZh4Ke4OnyOr=22=rKZneB9PmQDUDq=97vOSqqNq=bHNriSf=xT48cXy7AqWOnncwEqwbVcA25ds8O8S0WI9=ipEfIyiiJ7qSMoHY=kn7rwiE94jsVx5n7Syj=m58Fqvi=HCFI0Bwf8byFhWbeJsAK5UaDqchCY5qC9n-OUqmeJHay8OAqm-HQPnP9qBfyd08nini0FsrdvHmru4qA=sK4OKmzcY_wSj8D8D2jBQWHF2avq4UP8-D2Ysh4C_bXXhqmqK9RPyuXRoeC5Oad-FmUXy_5F_r0OKEnrAMC
X-NYUPe9Cs-f: A_v7kP18AQAAbfq9_kCtmTqfX2Eq0otHnwqUQCck5dPjX88Nxz2rTVnAnVxYAcmzs1ScuAA7wH8AADQwAAAAAA==
X-NYUPe9Cs-b: -8qa21q
X-NYUPe9Cs-c: AOBWjv18AQAAqntYtdrBc9F0C0KawiRISfcOH_ruhEoV4NNn-IemnXnq5vi1
X-NYUPe9Cs-d: AAaixIihDKqOocqASZAQjICihCKHpi15Rub4tUEPqzn1Pxi1AAd7zRXqBBDKOTmM_r5nbhq
X-NYUPe9Cs-z: q

This set of headers is only valid a limited time (no more than 24h AFAICT).

I'm curious anyone would further pinpoint where the logic lives (some crypto initialization vector may be provided by the cookie conveyed during initial page load). If so, node.js could compute that set of headers.

Upvotes: 1

undetected Selenium
undetected Selenium

Reputation: 193108

I modified your code a bit adding a couple of optional arguments and on execution I got the following result:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get('https://www.westernunion.com/de/en/web/send-money')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#button-fraud-warning-accept"))).click()
    python_button = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#country")))
    python_button.click()
    python_button.send_keys("Argentina")
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span#smoExchangeRate"))).text)
    
  • Console Output:

    1.00 EUR = Argentine Peso (ARS)
    
  • Observation: My observation was similar to your's that the exchange rate wasn't shown:

snapshot


Deep Dive

While inspecting the DOM Tree of the webpage you will find that some of the <script> and <link> tag refers to JavaScripts having keyword dist. As an example:

  • <script src="/content/wucom/dist/2.7.1.8f57d9b1/js/smo-configs/smo-config.de.js"></script>
  • <link rel="stylesheet" type="text/css" href="/content/wucom/dist/2.7.1.8f57d9b1/css/responsive_css.min.css">
  • <link rel="stylesheet" href="https://nebula-cdn.kampyle.com/resources/dist/assets/css/liveform-web-vendor-f84dfc85d6.css">
  • <link rel="stylesheet" href="https://nebula-cdn.kampyle.com/resources/dist/assets/css/kampyle/liveform-web-style-a4ce961d15.css">
  • <script src="https://nebula-cdn.kampyle.com/resources/dist/assets/js/liveform-web-vendor-919a2c71c3.js"></script>
  • <script src="https://nebula-cdn.kampyle.com/resources/dist/assets/js/liveform-web-app-2c4e3adeb6.js"></script>

Which is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in:

Upvotes: 2

Related Questions