How to parse HTML file in .TXT format (un-tabbed) in Python?

Question

I have encountered a problem in my programming that has me stumped.

I'm trying to access data stored in a wealth of old HTML-formatted-saved-as-text files. However, when saving the HTML code lost its indentations, tabs, hierarchy, whatever you wish to call it. An example of this can be found below.

......


Net sales
$ 123,897

$ 122,136

$ 372,586

$ 360,611



Membership and other income
997

1,043

3,026

3,465



Total revenues
124,894

123,179

375,612

364,076

I typically would employ Beautiful Soup here and get to work parsing the data that way, but I've not found a good workflow since technically there is no hierarchy here; I can't tell BS to look within something other than the document itself-which is huge and might be way too time consuming (see next statement).

I also need to find a thorough solution and not a quick-fix because I have hundreds, if not thousands, of these same HTML-to-text files to parse.

So my question here is, if I want to return, in all the files, the first number for "Membership and other Income" (997 in this case), how could I go about doing that?

Two samples files can be found here:

(https://www.sec.gov/Archives/edgar/data/1800/0001104659-18-065076.txt) (https://www.sec.gov/Archives/edgar/data/1084869/0001437749-18-020205.txt)

EDIT - 4/16

Thanks for the replies everyone! I've written some code that returns the tags I'm looking for.

import requests
from bs4 import BeautifulSoup

data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')

# load the data
soup = BeautifulSoup(data.text, 'html.parser')

# get the data
for tr in soup.find_all('tr', {'class':['rou','ro','re','reu']}):
    db = [td.text.strip() for td in tr.find_all('td')]
    print(db)

The problem is there are a TON of returns and most contain nothing of use. Is there a way to filter based on these tags' grandparent? I've tried the same approach as above using head, title, body, etc. but I can't quite get BS to identify the FILENAME..


XML
14
**R2.htm**
IDEA: XBRL DOCUMENT




.....removed for brevity


.....removed for brevity
 

.....removed for brevity

James Lane · Accepted Answer

Just so you are aware, HTML does not care about indentation. If you really wanted to, it could all be on the same line with no spaces in between. A HTML parser will just look at the structure of the tags.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all[''][0]

How to parse HTML file in .TXT format (un-tabbed) in Python?

Answers (1)

Related Questions