The World In 5
The World In 5

Reputation: 71

IndexError: cannot fit 'int' into an index-sized integer

So I'm trying to make my program print out the indexes of each word and punctuation, when it occurs, from a text file. I have done that part. - But the problem is when I'm trying to recreate the original text with punctuation using those index positions. Here is my code:

with open('newfiles.txt') as f:
    s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {} 
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
    if match not in d.keys():
        d[match] = i
        i+=1
    list_with_positions.append(d[match])

print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
    word_base = [None] + [z.strip() for z in f_base.read().split()]

sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
    sentence_seq = [word_base[int(i)] for i in f_select.read().split()]

print(' '.join(sentence_seq))

As i said the first part works fine but then i get the error:-

Traceback (most recent call last):
    File "E:\Python\Indexes.py", line 33, in <module>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
    File "E:\Python\Indexes.py", line 33, in <listcomp>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer

This error occurs when the program runs through 'sentence_seq' towards the bottom of the code

newfiles is the original text file - a random article with more than one sentence with punctuation

list_with_positions is the list with the actual positions of where each word occurs within the original text

matches is the separated DIFFERENT words - if words repeat in the file (which they do) matches should have only the different words.

Does anyone know why I get the error?

Upvotes: 6

Views: 20533

Answers (1)

roganjosh
roganjosh

Reputation: 13175

The issue with your approach is using ''.join() as this joins everything with no spaces. So, the immediate issue is that you attempt to then split() what is effectively a long series of digits with no spaces; what you get back is a single value with 100+ digits. So, the int overflows with a gigantic number when trying to use it as an index. Even more of an issue is that indices might go into double digits etc.; how did you expect split() to deal with that when numbers are joined without spaces?

Beyond that, you fail to treat punctuation properly. ' '.join() is equally invalid when trying to reconstruct a sentence because you have commas, full stops etc. getting whitespace on either side.

I tried my best to stick with your current code/approach (I don't think there's huge value in changing the entire approach when trying to understand where an issue comes from) but it still feels shakey for me. I dropped the regex, perhaps that was needed. I'm not immediately aware of a library for doing this kind of thing but almost certainly there must be a better way

import string

punctuation_list = set(string.punctuation) # Has to be treated differently

word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
    raw_data = infile.read().split()
    for index, item in enumerate(raw_data):
        index_dict[item] = index
        word_base.append(item)

with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
    for item in word_base:
        outfile1.write(str(item) + ' ')
        outfile2.write(str(index_dict[item]) + ' ')

reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
    indices = infile1.read().split()
    words = infile2.read().split()
    reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])

Upvotes: 1

Related Questions