Reputation: 495
It is my first time dealing with more than one unstructured data file, and I need to know if what am doing is the best approach or there is something better.
I have more than 1000 text file stand for different novels have text up to 139965 or more. and I have read them and save them in a data frame as shown below:
file_list = glob.glob("C:/.../TextFiles/*.txt")
data = pd.DataFrame({'Name':[],'Content':[]})
for file in file_list:
with open(file, 'r',encoding="utf8", errors='ignore') as myfile:
new_name=os.path.splitext(file)[0]
data=data.append({'Name':re.sub(".*\\\\", " ",new_name), 'Content': myfile.read()},ignore_index=True)
then I have started cleaning the texts by going row by row.
data['Name'] = data['Name'].apply(lambda x: " ".join(x.split()))
do you think this is the best approach to dealing with multi and large text files by saving them in data frame?
my next step will extract specific information from the text and save them in columns.
any advice?
Upvotes: 0
Views: 36
Reputation: 77337
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
In your case a list of [name, content]
sublists works.
file_list = glob.glob("C:/.../TextFiles/*.txt")
data = []
for file in file_list:
with open(file, 'r',encoding="utf8", errors='ignore') as myfile:
new_name=os.path.splitext(file)[0]
data.append([re.sub(".*\\\\", " ",new_name),
" ".join(myfile.read().split())])
data = pd.Dataframe(data, columns=['Name','Content'])
Upvotes: 1