Green
Green

Reputation: 685

Extract text from .txt file and save into .csv files with columns and header

I have approximately 100 text files with clinical notes that consist of 1-2 paragraphs. Each file is named doc_1.txt to doc_179.txt accordingly. I would like to save the text from each file into a .csv file with 2 columns w/ headers (id, text). The id columns are the name of each files.

For example doc_1 is the record file name and will become the id. The text in doc_1 will be stored the text column. The desired results is below


|   id  | text |
|:-----:|:----:|
| doc_1 | abcf |
| doc_2 | efrf |
| doc_3 | gvni |


So far I am to just viewed the text and have not determine the best practical way to achieve my results.

Upvotes: 0

Views: 1008

Answers (2)

Green
Green

Reputation: 685

I wanted to update the solution that was provided to me to resolve my problem.

import pandas as pd

import glob

txtfiles = []
for file in glob.glob("*.txt"):
    txtfiles.append(file)

files_list = [f for f in glob.glob("*.txt")]

df = pd.DataFrame(columns=["id", "text"])

for file in files_list:
    with open(file) as f:
        txt = f.read() # to retrieve the text in the file
        file_name = file.split(".")[0] # to remove file type
        df = df.append({"id": file_name, "text": txt}, ignore_index=True)


Upvotes: 1

Zac
Zac

Reputation: 159

Assume you would have a list of files.

import pandas as pd # remove if already imported

# ...

files_list = ["doc_1.txt", "doc_2.txt", ..., "doc_179.txt"]

Create DataFrame with the necessary columns:

df = pd.DataFrame(columns=["id", "text"])

Iterate through each file to read the text and then save into a csv file

for file in files_list:
    with open(file) as f:
        txt = f.read() # to retrieve the text in the file
        file_name = file.split(".")[0] # to remove file type
        df = df.append({"id": file_name, "text": txt}, ignore_index=True) # add row to DataFrame


df.to_csv("result.csv", sep="|", index=False) # export DataFrame into csv file

Feel free to change the name of the output csv file (result.csv) and the character used for sep.

It is strongly advised not to use a character that already contained in the text of any of the files. (For example, if any of the text files already contains commas in the text, do not use , as the sep value.)

Upvotes: 2

Related Questions