MITHU
MITHU

Reputation: 164

Can't grab a phone number from a webpage

I've created a script in python to fetch a phone number from a webpage but I can't find any idea as to how I can grab that as the number is in image.

Website link

This is how that number is displayed on that page:

enter image description here

I've written so far:

import requests
from bs4 import BeautifulSoup

url = "use_above_link"

def get_phone_number(link):
    resp = requests.get(link)
    soup = BeautifulSoup(resp.text,"lxml")
    phone = soup.select_one("img.phone-num-img")['src']
    print(phone)

if __name__ == '__main__':
  get_phone_number(url)

How can I scrape this very phone number from that webpage?

Upvotes: 1

Views: 455

Answers (2)

QHarr
QHarr

Reputation: 84475

Here you go.

The clues start with the following html that indicates the tel number likely has a base64 encoding

enter image description here

The base64 encoded value of that tel number is MDA5NzE1MjE3NjQ4MDY=. This value is not present on that page but is present at one of the other urls you can extract from the initial page html.

Issue a second request to that url, target the [data-tel] attribute, which is where the encoded string is stored, extract the base64 encoded string and decode.

import requests
from bs4 import BeautifulSoup as bs
import base64

with requests.Session() as s:
    r = s.get('https://dubai.dubizzle.com/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber')
    soup = bs(r.content, 'lxml')
    link = 'https://dubai.dubizzle.com' + soup.select_one('[media][href$=shownumber]')['href']
    r = s.get(link)
    soup = bs(r.content, 'lxml')
    encoded = soup.select_one('[data-tel]')['data-tel']
    tel = base64.b64decode(encoded)
    print(tel)

Notes:

It looks like the rel alternate (the second url) is simply a mobile device url and that you can issue just one request and substitute in /m/ into the original url i.e.

https://dubai.dubizzle.com/m/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber#

Code then simplifies to:

import requests
from bs4 import BeautifulSoup as bs
import base64

r = requests.get('https://dubai.dubizzle.com/m/motors/used-cars/hyundai/accent/2018/6/8/hyundai-accent-excellent-condition-still-u-2/?back=L21vdG9ycy91c2VkLWNhcnMvP3BhZ2U9MzUmcHJpY2VfX2d0ZT0mcHJpY2VfX2x0ZT0meWVhcl9fZ3RlPSZ5ZWFyX19sdGU9JmtpbG9tZXRlcnNfX2d0ZT0ma2lsb21ldGVyc19fbHRlPSZzZWxsZXJfdHlwZT1PVyZrZXl3b3Jkcz0maXNfYmFzaWNfc2VhcmNoX3dpZGdldD0wJmlzX3NlYXJjaD0xJnBsYWNlc19faWRfX2luPSZwbGFjZXNfX2lkX19pbj01OSUyQzkwJTJDMTMzJTJDMTA2JTJDMTg4JTJDJmFkZGVkX19ndGU9JmF1dG9fYWdlbnQ9&shownumber')
soup = bs(r.content, 'lxml')
encoded = soup.select_one('[data-tel]')['data-tel']
tel = base64.b64decode(encoded)
print(tel)

Upvotes: 1

nosh
nosh

Reputation: 600

1. Use a paid OCR service

The quickest way to solve this problem will be to use an OCR service. The downside: they are not free.

eg: Set up a google cloud project and enable the vision API. Instructions here. Then pass the image you acquired to the API and get the numbers back.

import requests
from bs4 import BeautifulSoup
from google.cloud import vision

url = "use_above_link"
client = vision.ImageAnnotatorClient()

def get_phone_number(link):
  resp = requests.get(link)
  soup = BeautifulSoup(resp.text,"lxml")
  phone_src_url = soup.select_one("img.phone-num-img")['src']
  print(phone_src_url)
  response = client.annotate_image({
    'image': {'source': {'image_uri': phone_src_url }},
    'features': [{'type': vision.enums.Feature.Type.TEXT_DETECTION}],
  })


if __name__ == '__main__':
  get_phone_number(url)

2. Use OPEN CV

This method is going to involve you writing a lot of code yourself. The main assumption here is you are going to parse dubizzle links. If that is the case, the font for those phone numbers is standard. You will have to parse the images of each digit from 0 to 9 into recognisable curves. Then you will need to detect the curves in each image. Detailed instructions here.

You find and cut out 10 images - one for each digit. This will be your master set. Then you will need to match the images by following the tutorial I linked. Depending on the position of each match, you will have to order the output from left to right.

Upvotes: 0

Related Questions