Reputation: 53
I have a folder full of 4000 txt files of time series data that I want to analyze using Pandas etc. I have been able to rename them and convert them into csv files however the columns are incorrectly grouped. I have gone through several posts here and watched several videos about parsing txt files etc, but nothing I've tried so far works.
Here's an example of one of the txt files. There are no leading or trailing white spaces, tab or newline's that I can see in notepad:
Here's the code I'm working on, which derives mainly here:
file = 'patient0.txt'
csv_f = "patient0.csv"
with open(file,'r') as in_txt:
stripped = (line.strip() for line in in_txt)
lines = (line for line in stripped if line)
grouped = zip(*[lines]*3)
with open(csv_f,'w') as out_file:
writer = csv.writer(out_file)
writer.writerows(grouped)
Here is the resulting csv file.
Here's how I need it to be formatted:
I just learned about generator's today. Here's the results of when I convert them to lists:
Upvotes: 0
Views: 992
Reputation: 118
** EDIT **
I just realized that this does not produce the exact format you are after. Leaving it up in case anyone else finds it useful
** EDIT **
The data you're looking looks like a custom format that uses key, value pairs. I don't know if you want to use the csv module to read these files. (Although it's very useful while writing the output csv files)
The format looks like this:
Different rows may have different parameters (can't tell from the really small data snippet you gave). It also looks looks like you added 'Time,Parameter,Value' to the front of the file, which is why we see that strange 'Value00:00' entry. I think you meant to put a newline after Value.
I made a dummy file with some data as I think you have it:
00:00, RecordID,5,Age,73
00:42,PaCO2,3400:42,PaO2,34401:11
01:11,SysABP,10501:11,Temp,35.201:11
Here, the unique column names we'd expect the output csv file to have are
RecordID, Age, PaCO2, PaO2, SysABP, Temp
We need to loop through the file to discover all of these. Once we've found them, we can create a csv.DictWriter with the appropriate columns. Then we loop through the input file again, writing everything to dict as we see it.
I successfully tested this script on the dummy file I created above. Hopefully what's happening is pretty clear from the comments in the script.
import csv
def txt_to_csv(input_filenames):
for input_filename in input_filenames:
column_names = set()
output_filename = input_filename[:-4] + '.csv'
with open(input_filename, 'rb') as in_txt:
# figure out which column names are in the file on at least one line
for line in in_txt:
# get a list of parameters that were split by comma in the input txt file
params = line.strip().split(",")
# lines[1::2] slices out every other entry starting with the first column name
# we or the entries into the set to keep our memory footprint small by only
# storing one copy of each unique column name
# we strip each entry of any extra whitespace while doing a set comprehension.
column_names |= set(params[1::2])
# notice that we always skip the first column with the timestamp by starting at 1
# strip off any extra whitespace in column names
column_names = {x.strip() for x in column_names}
# add in missing timestamp column to the column names
column_names.add('timestamp')
# sort column names and convert python3 strings to bytes as required by csv module
sorted_column_names = sorted(column_names)
# bring the pointer back to the beginning of the file
in_txt.seek(0, 0)
# open a csv file and start writing the output
with open(output_filename, 'wb') as out_csv:
writer = csv.DictWriter(out_csv, sorted_column_names, dialect='excel')
# write column names
writer.writeheader()
for line in in_txt:
# create a list of values for this line
params = [x.strip() for x in line.strip().split(",")]
# turn key value pairs into dictionary
row_dict = dict(zip(params[1::2], params[2::2]))
# write timestamp entry to the dictionary
row_dict['timestamp'] = params[0]
# write row to file
writer.writerow(row_dict)
if __name__ == '__main__':
input_filenames = [r'C:\Users\cruse\Desktop\dummy_data.txt']
txt_to_csv(input_filenames)
The output I got was
Age PaCO2 PaO2 RecordID SysABP Temp timestamp
73 5 0:00
3400:42:00 34401:11 0:42
10501:11 35.201:11 1:11
Which is correct for this dataset. You'd then use a tool like Pandas to propogate values through time. (i.e. to assign the same RecordID to all subsequent rows with pd.fillna)
If you want this to process more files, just add more paths to the input_filenames list at the bottom.
Upvotes: 1
Reputation: 61
It appears the colons in the original csv represent a new line, so convert these colons in the original text file to new lines and then save it as csv. It should then parse easily using:
import pandas as pd
df = pd.read_csv(csv_file_name)
Upvotes: 1