Reputation: 1395
I am a new programmer and we are working on a Graduate English project where we are trying to parse a gigantic dictionary text file (500 MB). The file is set up with html-like tags. I have 179 author tags eg. "[A>]Shakes.[/A]" for Shakespeare and what I need to do is find each occurrence of every tag and then write that tag and what follows on the line until I get to "[/W]".
My problem is that readlines() gives me a memory error (I am assuming because the file is so large) and I have been able to find matches (but only once) and not been able to get it to look past the first match. Any help that anyone could give would be greatly appreciated.
There are no new lines in the text file which I think is causing the problem. This problem has been solved. I thought I would include the code that worked:
with open('/Users/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Desktop/2e.txt','r') as open_file:
the_whole_file = open_file.read()
start_position = 0
while True:
start_position = the_whole_file.find('<A>', start_position)
if start_position < 0:
break
start_position += 3
end_position = the_whole_file.find('</W>', start_position)
output_file.write(the_whole_file[start_position:end_position])
output_file.write("\n")
start_position = end_position + 4
Upvotes: 1
Views: 894
Reputation: 2444
I don't know regular expressions well, but you can solve this problem without them, using the string method find() and line slicing.
answer = ''
with open('yourFile.txt','r') as open_file, open('output_file','w') as output_file:
for each_line in open_file:
if each_line.find('[A>]'):
start_position = each_line.find('[A>]')
start_position = start_position + 3
end_position = each_line[start_position:].find('[/W]')
answer = each_line[start_position:end_position] + '\n'
output_file.write(answer)
Let me explain what is happening:
Upvotes: 2
Reputation: 27565
Please, test the following code:
import re
regx = re.compile('<A>.+?</A>.*?<W>.*?</W>')
with open('/Users/Desktop/2e.txt','rb') as open_file,\
open('/Users/Desktop/Poetrylist.txt','wb') as output_file:
remain = ''
while True:
chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16
if not chunk: break
output_file.writelines( mat.group() + '\n' for mat in regx.finditer(remain + chunk) )
remain = chunk[mat.end(0)-len(remain):]
I couldn't test it because I have no file to test on.
Upvotes: 0
Reputation: 639
You're getting a memory error with readlines() because given the filesize you're likely reading in more data than your memory can reasonably handle. Since this file is an XML file, you should be able to read through it iterparse(), which will parse the XML lazily without taking up excess memory. Here's some code I used to parse Wikipedia dumps:
for event, elem in parser:
if event == 'start' and root == None:
root = elem
elif event == 'end' and elem.tag == namespace + 'title':
page_title = elem.text
#This clears bits of the tree we no longer use.
elem.clear()
elif event == 'end' and elem.tag == namespace + 'text':
page_text = elem.text
#Clear bits of the tree we no longer use
elem.clear()
#Now lets grab all of the outgoing links and store them in a list
key_vals = []
#Eliminate duplicate outgoing links.
key_vals = set(key_vals)
key_vals = list(key_vals)
count += 1
if count % 1000 == 0:
print str(count) + ' records processed.'
elif event == 'end' and elem.tag == namespace + 'page':
root.clear()
Here's roughly how it works:
We create parser to progress through the document.
As we loop through each element of the document, we look for elements with the tag you are looking for (in your example it was 'A').
We store that data and process it. Any element we are done processing we clear, because as we go through the document it remains in memory, so we want to remove anything we no longer need.
Upvotes: 1
Reputation: 1845
After opening the file, iterate through the lines like this:
input_file = open('huge_file.txt', 'r')
for input_line in input_file:
# process the line however you need - consider learning some basic regular expressions
This will allow you to easily process the file by reading it in line by line as needed rather than loading it all into memory at once
Upvotes: 3
Reputation: 20714
Instead of parsing the file by hand why not parse it as XML to have better control of the data? You mentioned that the data is HTML-like so I assume it is parseable as an XML document.
Upvotes: 0
Reputation: 28292
You should look into a tool called "Grep". You can give it a pattern to match and a file, and it will print out occurences in the file and line numbers, if you want. Very useful and probably can be interfaced with Python.
Upvotes: 0