dealing with multiple text files in Python

Question

It is my first time dealing with more than one unstructured data file, and I need to know if what am doing is the best approach or there is something better.

I have more than 1000 text file stand for different novels have text up to 139965 or more. and I have read them and save them in a data frame as shown below:

file_list = glob.glob("C:/.../TextFiles/*.txt")
data = pd.DataFrame({'Name':[],'Content':[]})

for file in file_list:
    with open(file, 'r',encoding="utf8", errors='ignore') as myfile:
        new_name=os.path.splitext(file)[0]
        data=data.append({'Name':re.sub(".*\\", " ",new_name), 'Content': myfile.read()},ignore_index=True)

then I have started cleaning the texts by going row by row.

data['Name'] = data['Name'].apply(lambda x: " ".join(x.split()))

do you think this is the best approach to dealing with multi and large text files by saving them in data frame?

my next step will extract specific information from the text and save them in columns.

any advice?

dealing with multiple text files in Python

Answers (1)

Related Questions