Reputation: 169
I have a folder with ~100 000 txt files. I'm trying to read all files and create a DataFrame with two columns id and text. For the id I'm taking numbers from my file name, for example, file BL2334_uyhjghbvbvhf, I'm extracting everything before underscore so in this example my id is BL2334. Before creating a data frame I would like to extract only words in Detected Text: ... So in this file word BUCK, NIP, Preerfal Deet Attracter.
My file:
Id: 02398123-a642-4e3f-88a7
Type: LINE
Detected Text: BUCK
Confidence: 77.965172
Id: c85bbbe
Type: LINE
Detected Text: NIP
Confidence: 97.186539
Id: 28926a7a-78024c80-b9c5
Type: LINE
Detected Text: Preerfal Deet Attracter
Confidence: 47.749722
My code:
import os
import pandas as pd
path = r'C:\Users\example\Documents\MyFolder'
file_list = []
for (root, dirs, files) in os.walk(path, topdown=True):
file_list.append([root + "\\" + file for file in files])
def flatten(file_list):
result_list_files = []
for element in file_list:
if isinstance(element, str):
result_list_files.append(element)
else:
for element_1 in flatten(element):
result_list_files.append(element_1)
return result_list_files
result_flatten = flatten(file_list)
final_df = pd.DataFrame()
for file in result_flatten:
temp_df = pd.DataFrame()
id = file.split('\\')[-1].split('_')[0]
temp_df['id'] = [id]
temp_df['text'] = [open(file,encoding="utf8").read()]
final_df = pd.concat([final_df, temp_df], ignore_index = True)
Upvotes: 3
Views: 194
Reputation: 4929
Just expanding the @Luca Angioioni' solution, you can use something like:
import os
import re
data = {'id': [], 'text': []}
for (root, dirs, files) in os.walk(path):
for file in files:
data['id'].append(file.split('_')[0])
with open(os.path.join(root, file)) as f:
data['text'].append(re.findall('Detected Text: (.*)\n', f.read()))
df = pd.DataFrame(data)
where it will return an id with a list of matchs in text
per row. You can always use df.explode('text')
to explode the matches to their own rows, with duplicated ids, though.
If you don't want to use re
for some reason, you can replace the last line by:
data['text'].append([line.split(':')[1].strip() for line in f if line.startswith('Detected Text')])
and it should work as well.
Upvotes: 1
Reputation: 2253
To only get the Detected Text
parts I would use a regular expression. Example:
import re
text = """
Id: 02398123-a642-4e3f-88a7
Type: LINE
Detected Text: BUCK
Confidence: 77.965172
Id: c85bbbe
Type: LINE
Detected Text: NIP
Confidence: 97.186539
Id: 28926a7a-78024c80-b9c5
Type: LINE
Detected Text: Preerfal Deet Attracter
Confidence: 47.749722
"""
pattern = re.compile(r"Detected Text: (.*)\n")
match = pattern.findall(text) # match becomes ['BUCK', 'NIP', 'Preerfal Deet Attracter']
One thing that makes your code slower is the fact that you keep allocating new Dataframes and then you concatenate them. One way around this could be to create a dictionary first with key = id, value = text
and then convert it to a DF using the from_dict
method: documentation. Or you could use a list of tuples like this (id, text)
and then just:
tuples = [
("id1", "some text")
("id2", "some other text")
...
]
final_df = pd.DataFrame(tuples, columns=['id', 'text'])
Upvotes: 1