Text in a class - href statement

Question

How could i get all the categories mentioned on each listing page of the same website "https://www.sfma.org.sg/member/category". for example, when i choose alcoholic beverage category on the above mentioned page, the listings mentioned on that page has the category information like this :-

Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier

how can i extract the categories mentioned here with in same variable.

The code i have written for this is :-

category = soup_2.find_all('a', attrs ={'class' :'clink'})
links = [links['href'] for links in category]
cat_name = [cat_name.text.strip() for cat_name in links]

but it is producing the below output which are all the links on the page & not the text with in the href:-

['http://www.sfma.org.sg/about/singapore-food-manufacturers-association',
 'http://www.sfma.org.sg/about/council-members',
 'http://www.sfma.org.sg/about/history-and-milestones',
 'http://www.sfma.org.sg/membership/',
 'http://www.sfma.org.sg/member/',
 'http://www.sfma.org.sg/member/alphabet/',
 'http://www.sfma.org.sg/member/category/',
 'http://www.sfma.org.sg/resources/sme-portal',
 'http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore',
 'http://www.sfma.org.sg/resources/import-export-requirements-and-procedures',
 'http://www.sfma.org.sg/resources/labelling-guidelines',
 'http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes',
 'http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard',
 'http://www.sfma.org.sg/resources/p-max',
 'http://www.sfma.org.sg/event/',
  .....]

What i need is the below data for all the listings of all the categories on the base URL which is "https://www.sfma.org.sg/member/category/"

['Ang Leong Huat Pte Ltd',
'16 Tagore Lane
 Singapore (787476)',
'Tel: +65 6749 9988',
'Fax: +65 6749 4321',
'Email: sales@alh.com.sg',
'Website: http://www.alh.com.sg/',
'Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier'

Please excuse if the question seems to be novice, i am just very new to python,

Thanks !!!

QHarr · Accepted Answer

The following targets the two javascript objects housing mapping info about companies names, categories and the shown tags e.g. bakery product. More more detailed info on the use of regex and splitting item['category'] - see my SO answer here.

It handles unquoted keys with hjson library.

You end up with a dict whose keys are the company names (I use permalink version of name, over name, as this should definitely be unique), and whose values are a tuple with 2 items. The first item is the company page link; the second is a list of the given tags e.g. bakery product, alcoholic beverage). The logic is there for you to re-organise as desired.

import requests
from bs4 import BeautifulSoup as bs
import hjson

base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
p1 = re.compile(r'var ddObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
lookup_data = hjson.loads(p1.findall(r.text)[0])
name_dict = {item['id']:item['name'] for item in lookup_data['category']}
companies = {}

for item in data['tmember']:
    companies[item['permalink']] = (base + item['permalink'], [name_dict[i] for i in item['category'].split(',')])

print(companies)

Updating for your additional request at end (Address info etc):

I then loop companies dict visiting each company url in tuple item 1 of value for current dict key; extract the required info into a dict, which I add the category info to, then update the current key:value with the dictionary just created.

import requests
from bs4 import BeautifulSoup as bs
import hjson

base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
p1 = re.compile(r'var ddObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
lookup_data = hjson.loads(p1.findall(r.text)[0])
name_dict = {item['id']:item['name'] for item in lookup_data['category']}
companies = {}

for item in data['tmember']:
    companies[item['permalink']] = (base + item['permalink'], [name_dict[i] for i in item['category'].split(',')])

with requests.Session() as s:
    for k,v in companies.items():
        r = s.get(v[0])
        soup = bs(r.content, 'lxml')
        tel = soup.select_one('.w3-text-sfma ~ p:contains(Tel)')    
        fax = soup.select_one('.w3-text-sfma ~ p:contains(Fax)')
        email = soup.select_one('.w3-text-sfma ~ p:contains(Email)')
        website = soup.select_one('.w3-text-sfma ~ p:contains(Website)')
        if tel is None:
            tel = 'N/A'
        else:
            tel = tel.text.replace('Tel: ','')
        if fax is None:
            fax = 'N/A'
        else:
            fax = fax.text.replace('Fax: ','') 
        if email is None:
            email = 'N/A'
        else:
            email = email.text.replace('Email: ','')
        if website is None:
            website = 'N/A'
        else:
            website = website.text.replace('Website: ','')
        info = {
                # 'Address' : ' '.join([i.text for i in soup.select('.w3-text-sfma ~ p:not(p:nth-child(n+4) ~ p)')])
                'Address' : ' '.join([i.text for i in soup.select('.w3-text-sfma ~ p:nth-child(-n+4)')])
                , 'Tel' : tel
                , 'Fax': fax
                , 'Email': email
                ,'Website' : website
                , 'Categories': v[1]
        }    
        companies[k] = info

Example entry in companies dict:

Text in a class - href statement

Answers (1)

Related Questions