Reputation: 319

Efficient way to read a lot of text files using python

I have about 20000 documents in subdirectories. And I would like to read them all and append them as a one list of lists. This is my code so far,

topics =os.listdir(my_directory)
df =[]
for topic in topics:
    files = os.listdir (my_directory+ '/'+ topic)
    print(files)

    for file in files: 
        print(file)
        f = open(my_directory+ '/'+ topic+ '/'+file, 'r', encoding ='latin1')
        data = f.read().replace('\n', ' ')
        print(data)
        f.close()
    df = np.append(df, data)

However this is inefficient, and it takes a long time to read and append them in the df list. My expected output is,

 df= [[doc1], [doc2], [doc3], [doc4],......,[doc20000]]

I ran the above code and it took more than 6 hours and was still not finished(probably did half of the documents).How can I change the code to make it faster?

Upvotes: 1

Answers (3)

stan0

Reputation: 11817

Note

I misread the line df = np.append(df, data) and I assumed you are appending to DataFrame, not to numpy array. So my comment is kind of irrelevant but I am leaving it for others that my misread like me or have a similar problem with pandas' DataFrame append.

Actual Problem

It looks like your question may not actually solve your actual problem. Have you measured the performance of your two most important calls?

files = os.listdir (my_directory+ '/'+ topic)
df = np.append(df, data)

The way you formatted your code makes me think there is a bug: df = np.append(df, data) is outside the file's for loop scope so I think only your last data is appended to your data frame. In case that's just problem with code formatting here in the post and you really do append 20k files to your data frame then this may be the problem - appending to DataFrame is slow.

Potential Solution

As usual slow performance can be tackled by throwing more memory at the problem. If you have enough memory to load all of the files beforehand and only then insert them in a DataFrame this could prove to be faster.

The key is to not deal with any pandas operation until you have loaded all the data. Only then you could use DataFrame's from_records or one of its other factory methods.

A nice SO question that has a little more discussion I found: Improve Row Append Performance On Pandas DataFrames

TL;DR

Measure the time to read all the files without dealing with pandas at all
If it proves to be much much faster and you have enough memory to load all the files' contents at once use another way to construct your DataFrame, say DataFrame.from_records

Upvotes: 0

Mahsa Hassankashi

Reputation: 2137

Generator functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.

generators

lazy function generator

def read_in_chunks(file, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file.read(chunk_size)
        if not data:
            break
        yield data


with open('big_file.dat') as f:
    for piece in read_in_chunks(f):
        process_data(piece)

class Reader(object):
    def __init__(self, g):
        self.g = g
    def read(self, n=0):
        try:
            return next(self.g)
        except StopIteration:
            return ''

df = pd.concat(list(pd.read_csv(Reader(read_in_chunks()),chunksize=10000)),axis=1)
df.to_csv("output.csv", index=False)

Upvotes: 0

tdelaney

Reputation: 77407

There is only so much you can do to speed disk access. You can use threads to overlap some file read operations with the latin1 decode and newline replacement. But realistically, it won't make a huge difference.

import multiprocessing.pool

MEG = 2**20
filelist = []

topics =os.listdir(my_directory)
for topic in topics:
    files = os.listdir (my_directory+ '/'+ topic)
    print(files)

    for file in files: 
        print(file)
        filelist.append(my_directory+ '/'+ topic+ '/'+file)

def worker(filename):
    with open(filename, encoding ='latin1',  bufsize=1*MEG) as f:
        data = f.read().replace('\n', ' ')
        #print(data)
        return data

with multiprocessing.pool.ThreadPool() as pool:
    datalist = pool.map(worker, filelist, chunksize=1)
df = np.array(datalist)