Reputation: 336
So, the basic idea is to make get request to certain list URLs and parse text from those page sources by removing HTML tags and scripts using beautifulsoup. python version 2.7
The problem, at every request, parser function keep adding memory at every request. size increasing gradually.
def get_text_from_page_source(page_source):
soup = BeautifulSoup(open(page_source),'html.parser')
# soup = BeautifulSoup(page_source,"lxml")
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
# print text
return text
even at local text file for parsing memory leaks. for example:
#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB
#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB
Upvotes: 7
Views: 2660
Reputation: 11871
You can try to call garbage collector:
import gc
response.close()
response = None
gc.collect()
Also this might help you: Python high memory usage with BeautifulSoup
Upvotes: 2
Reputation: 346
You could try running soup.decompose
right before ending your get_text_from_page_source
function to destroy the tree.
And in case you're opening a text file instead of directly feeding the requests contents, as it can be seen here:
soup = BeautifulSoup(open(page_source),'html.parser')
Remember to close it when you are done. To keep it short, you could change that line to:
with open(page_source, 'r') as html_file:
soup = BeautifulSoup(html_file.read(),'html.parser')
Upvotes: 0