Vishal Gahlot
Vishal Gahlot

Reputation: 48

How to scrape data from html table in python

<tr class="even">
<td><strong><a href='../eagleweb/viewDoc.jsp?node=DOC186S8881'>DEED<br/>
2016002023</a></strong></td>
<td><a href='../eagleweb/viewDoc.jsp?node=DOC186S8881'><b> Recording Date: </b>01/12/2016 08:05:17 AM&nbsp;&nbsp;&nbsp;<b>Book Page: </b> <table cellspacing=0 width="100%"><tr><td width="50%"  valign="top"><b>Grantor:</b> ARELLANO ISAIAS</td><td width="50%"  valign="top"><b>Grantee:</b> ARELLANO ISAIAS, ARELLANO ALICIA</td></tr></table>
<b>Number Pages:</b> 3<br></a></td>
<td></td>
<td></td></tr>

I am new to python and scraping please help me how to scrape data from this table. For login go to public login and then enter the to and from dates.

Data Model: The data model has columns in this specific order and casing: “record_date”, “doc_number”, “doc_type”, “role”, “name”, “apn”, "transfer_amount", “county”, and “state”. The “role” column will either be “Grantor” or “Grantee”, depending on where the name is assigned. If there are multiple names for grantors and grantees, give each name a new line and copy the recording date, document number, document type, role, and apn.

https://crarecords.sonomacounty.ca.gov/recorder/eagleweb/docSearchResults.jsp?searchId=0

Upvotes: 0

Views: 2034

Answers (2)

Everett
Everett

Reputation: 9558

I know this is an old question, but one underrated secret for this task is Panda's read_clipboard function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html

I think it's using BeautifulSoup behind the scenes, but the interface for simple usage is very straightforward. Consider this simple script:

# 1. Go to a website, e.g. https://www.wunderground.com/hurricane/hurrarchive.asp?region=ep
# 2. Highlight the table of data, e.g. of Hurricanes in the East Pacific
# 3. Copy the text from your browser
# 4. Run this script: the data will be available as a dataframe
import pandas as pd
df = pd.read_clipboard()
print(df)

Granted, this solution requires user interaction, but for a lot of cases, I've found it useful when there is no convenient CSV download or API endpoint.

Upvotes: 1

The html you posted does not contain all the column fields listed in your Data Model. However, for the fields it does contain, this will produce a python dictionary which you can get the fields for the Data Model:

import urllib.request
from bs4 import BeautifulSoup

url = "the_url_of_webpage_to_scrape" # Replace with the URL of your webpage

with urllib.request.urlopen(url) as response:
    html = response.read()

soup = BeautifulSoup(html, 'html.parser')

table = soup.find("tr", attrs={"class":"even"})

btags = [str(b.text).strip().strip(':') for b in table.find_all("b")]

bsibs = [str(b.next_sibling.replace(u'\xa0', '')).strip() for b in table.find_all('b')]

data = dict(zip(btags, bsibs))

data_model = {"record_date": None, "doc_number": None, "doc_type": None, "role": None, "name": None, "apn": None, "transfer_amount": None, "county": None, "state": None}

data_model["record_date"] = data['Recording Date']
data_model['role'] = data['Grantee']

print(data_model)

output:

{'apn': None,
 'county': None,
 'doc_number': None,
 'doc_type': None,
 'name': None,
 'record_date': '01/12/2016 08:05:17 AM',
 'role': 'ARELLANO ISAIAS, ARELLANO ALICIA',
 'state': None,
 'transfer_amount': None}

With this you can do:

print(data_model['record_date']) # 01/12/2016 08:05:17 AM
print(data_model['role'])        # ARELLANO ISAIAS, ARELLANO ALICIA

Hope this helps.

Upvotes: 1

Related Questions