Reputation: 123
I'm unable to get data from a single tag. Single tag have many data attributes like name, phone, company and url. i need to get data from many tags and all tags similar to this one.
HTML Code:
<div class="ListingDetails">
<div class="ListingDisplayName">
<a href="/members/jeremy.counter1/default.aspx">
Jeremy Counter
</a>
</div>
Mortgage Officer -
American Pacific Mortgage<br>
Anchorage, Alaska 99503<br>
phone: (907) 519-
6656 | (907) 250-0766
<div class="listingurl">
<a rel="nofollow" href="http://www.jeremycounter.com" target="_blank">
jeremycounter.com
</a>
</div>
</div>
Python Code:
data=requests.get(url)
soup=bs4.BeautifulSoup(data.text,'html.parser')
page = soup.find('div', class_="CommonContentBox DirectoryListings")
listing_box = page.find('div', class_="BusinessListingUser")
name = listing_box.find('div', class_="ListingDisplayName").text
#print(name)
details = listing_box.find('div',
class_="ListingDetails").text.strip('\n\t\r')
print(details)
Output:
Tyler Tullis
-
Montgomery, Alabama 36117
| (334) 322-3707
Anyone tell me the best possible solution to get data.
Expected result:
name: Jeremy Counter
phone: (907) 519-6656
company: American Pacific Mortgage
url: jeremycounter.com
Upvotes: 0
Views: 58
Reputation: 28575
No need for Selenium here. Just pull the data and iterate through it to clean it and print it:
import requests
import bs4
url = "http://www.mortgagenewsdaily.com/directory/mortgage/alabama"
data=requests.get(url)
soup=bs4.BeautifulSoup(data.text,'html.parser')
page = soup.find_all('div', class_="BusinessListingUser")
for each in page:
content = each.find('div', class_='ListingDetails').text.split('\n')
content = [ text.strip() for text in content if text.strip() != '' ]
for strings in content:
print (strings)
print ('\n')
Output:
Tyler Tullis
-
Montgomery, Alabama 36117
| (334) 322-3707
Nathan Stotlar
Mortgage Production Manager - PrimeLending, a PlainsCapital Company
Fitchburg, Wisconsin 53717
phone: (608) 467-4249
nathanstotlar.com
Anna Mendonca
Mortgage Loan Originator - CrossCountry Mortgage, Inc
Wakefield , Massachusetts 01880
phone: (781) 618-3154 | (781) 290-6383
myccmhomeloan.com/Default.aspx
Pouyan Broukhim
Owner - Probate Funding, Inc.
Los Angeles, California 90048
phone: (323) 935-5577
probatefunding.com
...
ADDITIONAL:
import requests
import bs4
import pandas as pd
url = "http://www.mortgagenewsdaily.com/directory/mortgage/alabama"
data=requests.get(url)
soup=bs4.BeautifulSoup(data.text,'html.parser')
page = soup.find_all('div', class_="BusinessListingUser")
results = pd.DataFrame()
for each in page:
content = each.find('div', class_='ListingDetails').text.split('\n')
content = [ text.strip() for text in content if text.strip() != '' ]
try:
name = content[0]
except:
name = 'N/A'
try:
company = content[1]
except:
company = 'N/A'
try:
location = content[2]
except:
location = 'N/A'
try:
phone = content[3]
except:
phone = 'N/A'
try:
website = content[4]
except:
website = 'N/A'
temp_df = pd.DataFrame([[name,company,location,phone,website]], columns = ['name','company','location','phone','website'])
results = results.append(temp_df).reset_index(drop=True)
results.to_excel('C:/file.xlsx', index=False)
Upvotes: 1
Reputation: 33384
Based on your html you can try following code.
from bs4 import BeautifulSoup
data='''<div class="ListingDetails">
<div class="ListingDisplayName">
<a href="/members/jeremy.counter1/default.aspx">
Jeremy Counter
</a>
</div>
Mortgage Officer -
American Pacific Mortgage<br>
Anchorage, Alaska 99503<br>
phone: (907) 519-
6656 | (907) 250-0766
<div class="listingurl">
<a rel="nofollow" href="http://www.jeremycounter.com" target="_blank">
jeremycounter.com
</a>
</div>
</div>'''
soup=BeautifulSoup(data,'html.parser')
items= soup.find_all('div', class_="ListingDetails")
for item in items:
print("name: " + item.find('a').text.strip())
print('company: ' + item.find_all('br')[0].previous_element.strip().split('-')[1].strip())
print('Phone: ' + item.find_all('br')[1].next_element.strip().split('|')[0].strip())
print('url: ' + item.find('div',class_='listingurl').find('a').text.strip())
Output:
name: Jeremy Counter
company: American Pacific Mortgage
Phone: phone: (907) 519-
6656
url: jeremycounter.com
Upvotes: 0
Reputation: 5730
You can use selenium for this task:
from selenium import webdriver
import os
# setup path to chrome driver
chrome_driver = os.getcwd() + '/chromedriver'
# initialise chrome driver
browser = webdriver.Chrome(chrome_driver)
# load url
url = 'http://www.mortgagenewsdaily.com/directory/mortgage/alabama'
browser.get(url)
# find all elements
content = browser.find_elements_by_xpath('//*[@id="CommonContentInner"]/div/div/div/div/div')
# get text from each element
collected_data = []
for item in content:
personal_data = item.get_attribute("innerText")
collected_data.append(personal_data)
# clean list
collected_data = filter(None, collected_data)
Output:
-----------
Tyler Tullis
-
Montgomery, Alabama 36117
| (334) 322-3707
-----------
Nathan Stotlar
Mortgage Production Manager - PrimeLending, a PlainsCapital
Company
Fitchburg, Wisconsin 53717
phone: (608) 467-4249
nathanstotlar.com
-----------
.
.
.
Upvotes: 0