Reputation: 319
I have about 20000 documents in subdirectories. And I would like to read them all and append them as a one list of lists. This is my code so far,
topics =os.listdir(my_directory)
df =[]
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
f = open(my_directory+ '/'+ topic+ '/'+file, 'r', encoding ='latin1')
data = f.read().replace('\n', ' ')
print(data)
f.close()
df = np.append(df, data)
However this is inefficient, and it takes a long time to read and append them in the df list. My expected output is,
df= [[doc1], [doc2], [doc3], [doc4],......,[doc20000]]
I ran the above code and it took more than 6 hours and was still not finished(probably did half of the documents).How can I change the code to make it faster?
Upvotes: 1
Views: 1670
Reputation: 11817
I misread the line df = np.append(df, data)
and I assumed you are appending to DataFrame, not to numpy array. So my comment is kind of irrelevant but I am leaving it for others that my misread like me or have a similar problem with pandas' DataFrame append.
It looks like your question may not actually solve your actual problem. Have you measured the performance of your two most important calls?
files = os.listdir (my_directory+ '/'+ topic)
df = np.append(df, data)
The way you formatted your code makes me think there is a bug: df = np.append(df, data)
is outside the file's for loop scope so I think only your last data
is appended to your data frame. In case that's just problem with code formatting here in the post and you really do append 20k files to your data frame then this may be the problem - appending to DataFrame
is slow.
As usual slow performance can be tackled by throwing more memory at the problem. If you have enough memory to load all
of the files beforehand and only then insert them in a DataFrame this could prove to be faster.
The key is to not deal with any pandas operation until you have loaded all the data. Only then you could use DataFrame
's from_records
or one of its other factory methods.
A nice SO question that has a little more discussion I found: Improve Row Append Performance On Pandas DataFrames
DataFrame
, say DataFrame.from_records
Upvotes: 0
Reputation: 2137
Generator functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.
def read_in_chunks(file, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file.read(chunk_size)
if not data:
break
yield data
with open('big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''
df = pd.concat(list(pd.read_csv(Reader(read_in_chunks()),chunksize=10000)),axis=1)
df.to_csv("output.csv", index=False)
Upvotes: 0
Reputation: 77407
There is only so much you can do to speed disk access. You can use threads to overlap some file read operations with the latin1
decode and newline replacement. But realistically, it won't make a huge difference.
import multiprocessing.pool
MEG = 2**20
filelist = []
topics =os.listdir(my_directory)
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
filelist.append(my_directory+ '/'+ topic+ '/'+file)
def worker(filename):
with open(filename, encoding ='latin1', bufsize=1*MEG) as f:
data = f.read().replace('\n', ' ')
#print(data)
return data
with multiprocessing.pool.ThreadPool() as pool:
datalist = pool.map(worker, filelist, chunksize=1)
df = np.array(datalist)
Upvotes: 1