C3Theo
C3Theo

Reputation: 53

Columns not separated by comma when converting multiple .txt files into .csv files in Python

I have a folder full of 4000 txt files of time series data that I want to analyze using Pandas etc. I have been able to rename them and convert them into csv files however the columns are incorrectly grouped. I have gone through several posts here and watched several videos about parsing txt files etc, but nothing I've tried so far works.

Here's an example of one of the txt files. There are no leading or trailing white spaces, tab or newline's that I can see in notepad: enter image description here

Here's the code I'm working on, which derives mainly here:

file = 'patient0.txt'
csv_f = "patient0.csv"
with open(file,'r') as in_txt:
        stripped = (line.strip() for line in in_txt)
        lines = (line for line in stripped if line)
        grouped = zip(*[lines]*3)
        with open(csv_f,'w') as out_file:
            writer = csv.writer(out_file)
            writer.writerows(grouped)

Here is the resulting csv file. enter image description here

Here's how I need it to be formatted:

enter image description here

I just learned about generator's today. Here's the results of when I convert them to lists: enter image description here

Upvotes: 0

Views: 992

Answers (2)

Alex O
Alex O

Reputation: 118

** EDIT **

I just realized that this does not produce the exact format you are after. Leaving it up in case anyone else finds it useful

** EDIT **

The data you're looking looks like a custom format that uses key, value pairs. I don't know if you want to use the csv module to read these files. (Although it's very useful while writing the output csv files)

The format looks like this:

Different rows may have different parameters (can't tell from the really small data snippet you gave). It also looks looks like you added 'Time,Parameter,Value' to the front of the file, which is why we see that strange 'Value00:00' entry. I think you meant to put a newline after Value.

I made a dummy file with some data as I think you have it:

00:00, RecordID,5,Age,73
00:42,PaCO2,3400:42,PaO2,34401:11
01:11,SysABP,10501:11,Temp,35.201:11

Here, the unique column names we'd expect the output csv file to have are

RecordID, Age, PaCO2, PaO2, SysABP, Temp

We need to loop through the file to discover all of these. Once we've found them, we can create a csv.DictWriter with the appropriate columns. Then we loop through the input file again, writing everything to dict as we see it.

I successfully tested this script on the dummy file I created above. Hopefully what's happening is pretty clear from the comments in the script.

import csv


def txt_to_csv(input_filenames):

    for input_filename in input_filenames:
        column_names = set()
        output_filename = input_filename[:-4] + '.csv'
        with open(input_filename, 'rb') as in_txt:

            # figure out which column names are in the file on at least one line
            for line in in_txt:

                # get a list of parameters that were split by comma in the input txt file
                params = line.strip().split(",")

                # lines[1::2] slices out every other entry starting with the first column name
                # we or the entries into the set to keep our memory footprint small by only
                # storing one copy of each unique column name
                # we strip each entry of any extra whitespace while doing a set comprehension.
                column_names |= set(params[1::2])

                # notice that we always skip the first column with the timestamp by starting at 1

            # strip off any extra whitespace in column names
            column_names = {x.strip() for x in column_names}

            # add in missing timestamp column to the column names
            column_names.add('timestamp')

            # sort column names and convert python3 strings to bytes as required by csv module
            sorted_column_names = sorted(column_names)

            # bring the pointer back to the beginning of the file
            in_txt.seek(0, 0)

            # open a csv file and start writing the output
            with open(output_filename, 'wb') as out_csv:
                writer = csv.DictWriter(out_csv, sorted_column_names, dialect='excel')

                # write column names
                writer.writeheader()

                for line in in_txt:
                    # create a list of values for this line
                    params = [x.strip() for x in line.strip().split(",")]

                    # turn key value pairs into dictionary
                    row_dict = dict(zip(params[1::2], params[2::2]))

                    # write timestamp entry to the dictionary
                    row_dict['timestamp'] = params[0]

                    # write row to file
                    writer.writerow(row_dict)


if __name__ == '__main__':
    input_filenames = [r'C:\Users\cruse\Desktop\dummy_data.txt']
    txt_to_csv(input_filenames)

The output I got was

Age PaCO2       PaO2      RecordID  SysABP      Temp            timestamp
73                        5                                     0:00
    3400:42:00  34401:11                                        0:42
                                    10501:11    35.201:11       1:11

Which is correct for this dataset. You'd then use a tool like Pandas to propogate values through time. (i.e. to assign the same RecordID to all subsequent rows with pd.fillna)

If you want this to process more files, just add more paths to the input_filenames list at the bottom.

Upvotes: 1

Mike Dale
Mike Dale

Reputation: 61

It appears the colons in the original csv represent a new line, so convert these colons in the original text file to new lines and then save it as csv. It should then parse easily using:

import pandas as pd

df = pd.read_csv(csv_file_name)

Upvotes: 1

Related Questions