Reputation: 71
So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:
with open('newfiles.txt') as f:
s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {}
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
if match not in d.keys():
d[match] = i
i+=1
list_with_positions.append(d[match])
print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
word_base = [None] + [z.strip() for z in f_base.read().split()]
sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
print(' '.join(sentence_seq))
As i said the first part works fine but then i get the error:-
Traceback (most recent call last):
File "E:\Python\Indexes.py", line 33, in <module>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
File "E:\Python\Indexes.py", line 33, in <listcomp>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer
This error occurs when the program runs through 'sentence_seq' towards the bottom of the code
newfiles is the original text file - a random article with more than one sentence with punctuation
list_with_positions is the list with the actual positions of where each word occurs within the original text
matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.
Does anyone know why I get the error?
Upvotes: 6
Views: 20533
Reputation: 13175
The issue with your approach is using ''.join()
as this joins everything with no spaces. So, the immediate issue is that you attempt to then split()
what is effectively a long series of digits with no spaces; what you get back is a single value with 100+ digits. So, the int
overflows with a gigantic number when trying to use it as an index. Even more of an issue is that indices might go into double digits etc.; how did you expect split()
to deal with that when numbers are joined without spaces?
Beyond that, you fail to treat punctuation properly. ' '.join()
is equally invalid when trying to reconstruct a sentence because you have commas, full stops etc. getting whitespace on either side.
I tried my best to stick with your current code/approach (I don't think there's huge value in changing the entire approach when trying to understand where an issue comes from) but it still feels shakey for me. I dropped the regex
, perhaps that was needed. I'm not immediately aware of a library for doing this kind of thing but almost certainly there must be a better way
import string
punctuation_list = set(string.punctuation) # Has to be treated differently
word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
raw_data = infile.read().split()
for index, item in enumerate(raw_data):
index_dict[item] = index
word_base.append(item)
with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
for item in word_base:
outfile1.write(str(item) + ' ')
outfile2.write(str(index_dict[item]) + ' ')
reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
indices = infile1.read().split()
words = infile2.read().split()
reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])
Upvotes: 1