midickinson
midickinson

Reputation: 79

Regular Expression to extract city state and country from html

I am using Outwit hub to scrape a website for city, state, and country (USA and Canada Only). With the program I can use regular expressions to define the markers Before and After the text I wish to grab. I can also define a format for the desired text.

Here is a sample of the html:

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>
BILLINGS, MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

I have set up my reg.ex. as follows:

CITY - Before (not formated as regex)

<td width="22%" nowrap="nowrap"><strong>

CITY - After (accounts for state, territory, and provences)

/(,\s|\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b|\bUSA|\bCanada)/

STATE - Before

\<td width="22%" nowrap="nowrap"\>\<strong\>\s|,\s

STATE - After

/\bUSA\<\/strong\>\<\/td\>|\bCanada\<\/strong\>\<\/td\>/

STATE - Format

/\b[A-Z][A-Z]\b/

COUNTRY - Before (accounts for state, territory, and provences)

/(\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b)\s/

COUNTRY - After (not formated as regex)

</strong></td><td width="10%" align="right" nowrap="nowrap">

The issue arrises when there is no city or state listed. I have tried to account for this, but am just making it worse. Is there any way this can be cleaned up and still account for the possibility of missing info? Thank you.

Example with no city:

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>
MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

Example with no city / state: (yes, there is an extra line break)

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>

USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

Thank you for any help you can provide.

Upvotes: 0

Views: 1574

Answers (2)

aljo
aljo

Reputation: 26

Here is what you can do if you have the pro version:

Description: Data
Before: <td width="22%" nowrap="nowrap"><strong>
After: </strong>
Format: (([\w \-]+),)? ?([A-Z]{2})?[\r\n](USA|canada)\s*
Replace: \2##\3##\4
Separator: ##
Labels: City,State,Country

If you are using the light version, you have to do it in three lines:

Description: City
Before: <td width="22%" nowrap="nowrap"><strong>
After: ,
Format: [^<>]+

Description: State
Before: /<td width="22%" nowrap="nowrap"><strong>[\r\n]([^<>\r\n ]+,)?/
After: /[\r\n]/
Format: [A-Z]{2}

Description: Country
Before:
After: </strong></td>
Format: (USA|canada)

Upvotes: 1

Kaz
Kaz

Reputation: 58568

TXR text scraping, data munging language:

@(collect)
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
@  (cases)
@city, @state
@  (or)

@    (bind (city state) ("n/a" "n/a"))
@  (or)
@state
@    (bind city "n/a")
@  (end)
@country</strong></td>
<td width="10%" align="right" nowrap="nowrap">
@(end)
@(output)
CITY       STATE       COUNTRY
@  (repeat)
@{city 10} @{state 11} @country
@  (end)
@(end)

The file city.html contains the tree cases catenated together. Run:

$ txr city.txr  city.html
CITY       STATE       COUNTRY
BILLINGS   MT          USA
n/a        MT          USA
n/a        n/a         USA

Another example of TXR HTML scraping: Extract text from HTML Table

Upvotes: 0

Related Questions