Q-ximi
Q-ximi

Reputation: 961

Why the "List index out of range" error?

So I have a list of files I want to read through and print this information out. It keeps giving me the error list index out of range. Not sure what was wrong. For line2, if I add matches[:10] it could work for the first 10 files. But I need it to do all files. Checked some old posts but still can not get mine code work.

re.findall worked before when I wrote this code in pieces. Not sure it is not working anymore. Thanks.

import re, os
topdir = r'E:\Grad\LIS\LIS590 Text mining\Part1\Part1' # Topdir has to be an object rather than a string, which means that there is no paranthesis.
matches = []
for root, dirnames, filenames in os.walk(topdir):
    for filename in filenames:
        if filename.endswith(('.txt','.pdf')):
            matches.append(os.path.join(root, filename))

capturedorgs = []
capturedfiles = []
capturedabstracts = []
orgAwards={}
for filepath in matches:
with open (filepath,'rt') as mytext:
    mytext=mytext.read()

    matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
            capturedorgs.append(matchOrg)

    # code to capture files
    matchFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)[0]
    capturedfiles.append(matchFile)

    # code to capture abstracts
    matchAbs=re.findall(r'Abstract\s+\:\s+(\w.+)',mytext)[0]
    capturedabstracts.append(matchAbs)

    # total awarded money
    matchAmt=re.findall(r'Total\s+Amt\.\s+\:\s+\$(\d+)',mytext)[0]

    if matchOrg not in orgAwards:
        orgAwards[matchOrg]=[]
    orgAwards[matchOrg].append(int(matchAmt))

for each in capturedorgs:
    print(each,"\n")
for each in capturedfiles:
    print(each,"\n")
for each in capturedabstracts:
    print (each,"\n")

# add code to print what is in your other two lists
from collections import Counter
countOrg=Counter(capturedorgs)
print (countOrg)

for each in orgAwards:
print(each,sum(orgAwards[each]))

The error message:

Traceback (most recent call last):
  File "C:\Python32\Assignment1.py", line 17, in <module>
    matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
IndexError: list index out of range

Upvotes: 4

Views: 13405

Answers (2)

Burhan Khalid
Burhan Khalid

Reputation: 174624

If findall doesn't find a match, it will return an empty list []; your error occurs when you try to fetch the first item from this empty list, resulting in your exception:

>>> import re
>>> i = 'hello'
>>> re.findall('abc', i)
[]
>>> re.findall('abc', i)[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

To make sure your code doesn't stop when no match is found, you need to catch the exception that is raised:

try:
    matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
    capturedorgs.append(matchOrg)
except IndexError:
    print('No organization match for {}'.format(filepath))

You will have to do this for each re.findall statement.

Upvotes: 4

Corley Brigman
Corley Brigman

Reputation: 12381

the problem is here:

matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]

Apparently, you have one file that does not have this in the file at all. So when you deference item [0], it's not there.

You will need to deal with this case.

One way is just to not including it at all if it's not found:

for filepath in matches:
    with open (filepath,'rt') as mytext:
        mytext=mytext.read()

        matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)
        if len(matchOrg) > 0:
            capturedorgs.append(matchOrg[0])

Also, you might want to use extend(matchOrg) if there is a chance of more than one in the file, and you want to capture all of them.

Upvotes: 2

Related Questions