Reputation: 912
I wanted to extract some numbers from text files. The text line is like 074 N00AA00 623938
and I need to extract the number 623938. I'm using the code below but it returns nothing:
url = 'https://www.sec.gov/Archives/edgar/data/1000249/0001236835-11-000143.txt'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_74s = soup.find_all(r'^(074\s[n|N].*\s)(\d*)*$')
I would appreciate your thoughts on the best way to extract the numbers.
Upvotes: 0
Views: 644
Reputation: 195573
To get correct response from the server, set User-Agent
HTTP header first.
Then, from the soup select text from the <TEXT>
tag and apply regex on it:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/1000249/0001236835-11-000143.txt"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
all_74s = re.findall(
r"^074\s+[n|N].*?\s+(\d+)$", soup.find("text").text, flags=re.M
)
print(all_74s)
Prints:
['623938']
Upvotes: 1