Extracting wanted words from multiple text files (Python 3.6)

Question

I have a folder with ~100 000 txt files. I'm trying to read all files and create a DataFrame with two columns id and text. For the id I'm taking numbers from my file name, for example, file BL2334_uyhjghbvbvhf, I'm extracting everything before underscore so in this example my id is BL2334. Before creating a data frame I would like to extract only words in Detected Text: ... So in this file word BUCK, NIP, Preerfal Deet Attracter.

My file:

Id: 02398123-a642-4e3f-88a7
Type: LINE
Detected Text: BUCK
Confidence: 77.965172
Id: c85bbbe
Type: LINE
Detected Text: NIP
Confidence: 97.186539
Id: 28926a7a-78024c80-b9c5
Type: LINE
Detected Text: Preerfal Deet Attracter
Confidence: 47.749722

My code:

import os
import pandas as pd

path = r'C:\Users\example\Documents\MyFolder'

file_list = []

for (root, dirs, files) in os.walk(path, topdown=True):
        file_list.append([root + "\" + file for file in files])
def flatten(file_list):
    result_list_files = []
    for element in file_list:
        if isinstance(element, str):
            result_list_files.append(element)
        else:
            for element_1 in flatten(element):
                result_list_files.append(element_1)
    return result_list_files 
result_flatten = flatten(file_list)

final_df = pd.DataFrame()

for file in result_flatten:
    temp_df = pd.DataFrame()
    id = file.split('\')[-1].split('_')[0]
    temp_df['id'] = [id]
    temp_df['text'] = [open(file,encoding="utf8").read()]
    final_df = pd.concat([final_df, temp_df], ignore_index = True)

Cain&#227; Max Couto da Silva · Accepted Answer

Just expanding the @Luca Angioioni' solution, you can use something like:

import os
import re

data = {'id': [], 'text': []}

for (root, dirs, files) in os.walk(path):
    for file in files:
        data['id'].append(file.split('_')[0])
        with open(os.path.join(root, file)) as f:
            data['text'].append(re.findall('Detected Text: (.*)
', f.read()))

df = pd.DataFrame(data)

where it will return an id with a list of matchs in text per row. You can always use df.explode('text') to explode the matches to their own rows, with duplicated ids, though.

If you don't want to use re for some reason, you can replace the last line by:

data['text'].append([line.split(':')[1].strip() for line in f if line.startswith('Detected Text')])

and it should work as well.

Extracting wanted words from multiple text files (Python 3.6)

Answers (2)

Related Questions