Reputation: 79
I am using Outwit hub to scrape a website for city, state, and country (USA and Canada Only). With the program I can use regular expressions to define the markers Before and After the text I wish to grab. I can also define a format for the desired text.
Here is a sample of the html:
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
BILLINGS, MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">
I have set up my reg.ex. as follows:
CITY - Before (not formated as regex)
<td width="22%" nowrap="nowrap"><strong>
CITY - After (accounts for state, territory, and provences)
/(,\s|\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b|\bUSA|\bCanada)/
STATE - Before
\<td width="22%" nowrap="nowrap"\>\<strong\>\s|,\s
STATE - After
/\bUSA\<\/strong\>\<\/td\>|\bCanada\<\/strong\>\<\/td\>/
STATE - Format
/\b[A-Z][A-Z]\b/
COUNTRY - Before (accounts for state, territory, and provences)
/(\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b)\s/
COUNTRY - After (not formated as regex)
</strong></td><td width="10%" align="right" nowrap="nowrap">
The issue arrises when there is no city or state listed. I have tried to account for this, but am just making it worse. Is there any way this can be cleaned up and still account for the possibility of missing info? Thank you.
Example with no city:
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">
Example with no city / state: (yes, there is an extra line break)
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">
Thank you for any help you can provide.
Upvotes: 0
Views: 1574
Reputation: 26
Here is what you can do if you have the pro version:
Description: Data
Before: <td width="22%" nowrap="nowrap"><strong>
After: </strong>
Format: (([\w \-]+),)? ?([A-Z]{2})?[\r\n](USA|canada)\s*
Replace: \2##\3##\4
Separator: ##
Labels: City,State,Country
If you are using the light version, you have to do it in three lines:
Description: City
Before: <td width="22%" nowrap="nowrap"><strong>
After: ,
Format: [^<>]+
Description: State
Before: /<td width="22%" nowrap="nowrap"><strong>[\r\n]([^<>\r\n ]+,)?/
After: /[\r\n]/
Format: [A-Z]{2}
Description: Country
Before:
After: </strong></td>
Format: (USA|canada)
Upvotes: 1
Reputation: 58568
TXR text scraping, data munging language:
@(collect)
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
@ (cases)
@city, @state
@ (or)
@ (bind (city state) ("n/a" "n/a"))
@ (or)
@state
@ (bind city "n/a")
@ (end)
@country</strong></td>
<td width="10%" align="right" nowrap="nowrap">
@(end)
@(output)
CITY STATE COUNTRY
@ (repeat)
@{city 10} @{state 11} @country
@ (end)
@(end)
The file city.html
contains the tree cases catenated together. Run:
$ txr city.txr city.html
CITY STATE COUNTRY
BILLINGS MT USA
n/a MT USA
n/a n/a USA
Another example of TXR HTML scraping: Extract text from HTML Table
Upvotes: 0