Extracting text using BeautifulSoup and Regex

Question

I wanted to extract some numbers from text files. The text line is like 074 N00AA00 623938 and I need to extract the number 623938. I'm using the code below but it returns nothing:

url = 'https://www.sec.gov/Archives/edgar/data/1000249/0001236835-11-000143.txt'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_74s = soup.find_all(r'^(074\s[n|N].*\s)(\d*)*$')

I would appreciate your thoughts on the best way to extract the numbers.

Andrej Kesely · Accepted Answer

To get correct response from the server, set User-Agent HTTP header first.

Then, from the soup select text from the tag and apply regex on it:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.sec.gov/Archives/edgar/data/1000249/0001236835-11-000143.txt"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
all_74s = re.findall(
    r"^074\s+[n|N].*?\s+(\d+)$", soup.find("text").text, flags=re.M
)
print(all_74s)

Prints:

['623938']

Extracting text using BeautifulSoup and Regex

Answers (1)

Related Questions