My80
My80

Reputation: 169

Extracting wanted words from multiple text files (Python 3.6)

I have a folder with ~100 000 txt files. I'm trying to read all files and create a DataFrame with two columns id and text. For the id I'm taking numbers from my file name, for example, file BL2334_uyhjghbvbvhf, I'm extracting everything before underscore so in this example my id is BL2334. Before creating a data frame I would like to extract only words in Detected Text: ... So in this file word BUCK, NIP, Preerfal Deet Attracter.

My file:

Id: 02398123-a642-4e3f-88a7
Type: LINE
Detected Text: BUCK
Confidence: 77.965172
Id: c85bbbe
Type: LINE
Detected Text: NIP
Confidence: 97.186539
Id: 28926a7a-78024c80-b9c5
Type: LINE
Detected Text: Preerfal Deet Attracter
Confidence: 47.749722

My code:

import os
import pandas as pd

path = r'C:\Users\example\Documents\MyFolder'

file_list = []

for (root, dirs, files) in os.walk(path, topdown=True):
        file_list.append([root + "\\" + file for file in files])
def flatten(file_list):
    result_list_files = []
    for element in file_list:
        if isinstance(element, str):
            result_list_files.append(element)
        else:
            for element_1 in flatten(element):
                result_list_files.append(element_1)
    return result_list_files 
result_flatten = flatten(file_list)

final_df = pd.DataFrame()

for file in result_flatten:
    temp_df = pd.DataFrame()
    id = file.split('\\')[-1].split('_')[0]
    temp_df['id'] = [id]
    temp_df['text'] = [open(file,encoding="utf8").read()]
    final_df = pd.concat([final_df, temp_df], ignore_index = True)


Upvotes: 3

Views: 194

Answers (2)

Just expanding the @Luca Angioioni' solution, you can use something like:

import os
import re

data = {'id': [], 'text': []}

for (root, dirs, files) in os.walk(path):
    for file in files:
        data['id'].append(file.split('_')[0])
        with open(os.path.join(root, file)) as f:
            data['text'].append(re.findall('Detected Text: (.*)\n', f.read()))

df = pd.DataFrame(data)

where it will return an id with a list of matchs in text per row. You can always use df.explode('text') to explode the matches to their own rows, with duplicated ids, though.


If you don't want to use re for some reason, you can replace the last line by:

data['text'].append([line.split(':')[1].strip() for line in f if line.startswith('Detected Text')])

and it should work as well.

Upvotes: 1

Luca Angioloni
Luca Angioloni

Reputation: 2253

  • To only get the Detected Text parts I would use a regular expression. Example:

    import re
    
    text = """
    Id: 02398123-a642-4e3f-88a7
    Type: LINE
    Detected Text: BUCK
    Confidence: 77.965172
    Id: c85bbbe
    Type: LINE
    Detected Text: NIP
    Confidence: 97.186539
    Id: 28926a7a-78024c80-b9c5
    Type: LINE
    Detected Text: Preerfal Deet Attracter
    Confidence: 47.749722
    """
    
    pattern = re.compile(r"Detected Text: (.*)\n")
    match = pattern.findall(text)  # match becomes ['BUCK', 'NIP', 'Preerfal Deet Attracter']
    
  • One thing that makes your code slower is the fact that you keep allocating new Dataframes and then you concatenate them. One way around this could be to create a dictionary first with key = id, value = text and then convert it to a DF using the from_dict method: documentation. Or you could use a list of tuples like this (id, text) and then just:

    tuples = [
    ("id1", "some text")
    ("id2", "some other text")
    ...
    ]
    final_df = pd.DataFrame(tuples, columns=['id', 'text'])
    

Upvotes: 1

Related Questions