Reputation: 193
How could i get all the categories mentioned on each listing page of the same website "https://www.sfma.org.sg/member/category". for example, when i choose alcoholic beverage category on the above mentioned page, the listings mentioned on that page has the category information like this :-
Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier
how can i extract the categories mentioned here with in same variable.
The code i have written for this is :-
category = soup_2.find_all('a', attrs ={'class' :'clink'})
links = [links['href'] for links in category]
cat_name = [cat_name.text.strip() for cat_name in links]
but it is producing the below output which are all the links on the page & not the text with in the href:-
['http://www.sfma.org.sg/about/singapore-food-manufacturers-association',
'http://www.sfma.org.sg/about/council-members',
'http://www.sfma.org.sg/about/history-and-milestones',
'http://www.sfma.org.sg/membership/',
'http://www.sfma.org.sg/member/',
'http://www.sfma.org.sg/member/alphabet/',
'http://www.sfma.org.sg/member/category/',
'http://www.sfma.org.sg/resources/sme-portal',
'http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore',
'http://www.sfma.org.sg/resources/import-export-requirements-and-procedures',
'http://www.sfma.org.sg/resources/labelling-guidelines',
'http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes',
'http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard',
'http://www.sfma.org.sg/resources/p-max',
'http://www.sfma.org.sg/event/',
.....]
What i need is the below data for all the listings of all the categories on the base URL which is "https://www.sfma.org.sg/member/category/"
['Ang Leong Huat Pte Ltd',
'16 Tagore Lane
Singapore (787476)',
'Tel: +65 6749 9988',
'Fax: +65 6749 4321',
'Email: [email protected]',
'Website: http://www.alh.com.sg/',
'Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier'
Please excuse if the question seems to be novice, i am just very new to python,
Thanks !!!
Upvotes: 0
Views: 85
Reputation: 84465
The following targets the two javascript objects housing mapping info about companies names, categories and the shown tags e.g. bakery product. More more detailed info on the use of regex and splitting item['category'] - see my SO answer here.
It handles unquoted keys with hjson library.
You end up with a dict whose keys are the company names (I use permalink
version of name, over name
, as this should definitely be unique), and whose values are a tuple with 2 items. The first item is the company page link; the second is a list of the given tags e.g. bakery product, alcoholic beverage). The logic is there for you to re-organise as desired.
import requests
from bs4 import BeautifulSoup as bs
import hjson
base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
p1 = re.compile(r'var ddObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
lookup_data = hjson.loads(p1.findall(r.text)[0])
name_dict = {item['id']:item['name'] for item in lookup_data['category']}
companies = {}
for item in data['tmember']:
companies[item['permalink']] = (base + item['permalink'], [name_dict[i] for i in item['category'].split(',')])
print(companies)
Updating for your additional request at end (Address info etc):
I then loop companies
dict visiting each company url in tuple item 1 of value for current dict key; extract the required info into a dict, which I add the category info to, then update the current key:value with the dictionary just created.
import requests
from bs4 import BeautifulSoup as bs
import hjson
base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
p1 = re.compile(r'var ddObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
lookup_data = hjson.loads(p1.findall(r.text)[0])
name_dict = {item['id']:item['name'] for item in lookup_data['category']}
companies = {}
for item in data['tmember']:
companies[item['permalink']] = (base + item['permalink'], [name_dict[i] for i in item['category'].split(',')])
with requests.Session() as s:
for k,v in companies.items():
r = s.get(v[0])
soup = bs(r.content, 'lxml')
tel = soup.select_one('.w3-text-sfma ~ p:contains(Tel)')
fax = soup.select_one('.w3-text-sfma ~ p:contains(Fax)')
email = soup.select_one('.w3-text-sfma ~ p:contains(Email)')
website = soup.select_one('.w3-text-sfma ~ p:contains(Website)')
if tel is None:
tel = 'N/A'
else:
tel = tel.text.replace('Tel: ','')
if fax is None:
fax = 'N/A'
else:
fax = fax.text.replace('Fax: ','')
if email is None:
email = 'N/A'
else:
email = email.text.replace('Email: ','')
if website is None:
website = 'N/A'
else:
website = website.text.replace('Website: ','')
info = {
# 'Address' : ' '.join([i.text for i in soup.select('.w3-text-sfma ~ p:not(p:nth-child(n+4) ~ p)')])
'Address' : ' '.join([i.text for i in soup.select('.w3-text-sfma ~ p:nth-child(-n+4)')])
, 'Tel' : tel
, 'Fax': fax
, 'Email': email
,'Website' : website
, 'Categories': v[1]
}
companies[k] = info
Example entry in companies
dict:
Upvotes: 1