user3802773
user3802773

Reputation: 51

logic for python web scraper for business names

I am new to python and was wondering if there was a way to get the business name of a website through a python script.

I have 1000s of businesses I need to validate for their names and was wondering if it was possible to scale this up by looking at their website or address and find the registered business name under the address.

I want to ask this question here before I waste my research time on if this is even possible.

Thank you for any help in advanced.

Upvotes: 1

Views: 2148

Answers (1)

chishaku
chishaku

Reputation: 4643

In certain cases, the page title of the website homepage could be an approximation of the full business name.

The following is a very simple example of pinging a website homepage and returning the <title> tag, an approximation of the business name. You need to install the requests and lxml libraries.

import requests
from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()

urls = ['http://google.com', 'http://facebook.com', 'http://stackoverflow.com']
for url in urls:
    r = requests.get(url)
    html = r.text
    tree  = etree.parse(StringIO(html), parser) 
    title = tree.xpath('//title/text()')
    print url, title

>>>
http://google.com ['Google']
http://facebook.com ['Welcome to Facebook - Log In, Sign Up or Learn More']
http://stackoverflow.com ['Stack Overflow']

In other cases, you might want to navigate to a 'Legal' or 'Contact Us' page if you need find the full legal business name. That's much trickier because the name isn't necessarily associated with any html tag; it's likely just free text floating somewhere on your page.

Upvotes: 1

Related Questions