Reputation: 745
I have a large collection of DAT files that need to be converted (eventually to a unique file type). The DAT's have a mixed amount of whitespace between fields, and the column headers are on different lines. Any advice?
ALT_RAD
ALT_RAD2
DIRECT D_GLOBAL U_GLOBAL Zenith
Year Mn Dy Hr Mi DIFFUSE2 D_IR U_IR
2004 9 1 0 1 1.04 79.40 78.67 303.58 61.06 310.95 85.142
2004 9 1 0 2 0.71 74.36 73.91 303.80 57.82 310.92 85.171
2004 9 1 0 3 0.67 71.80 71.64 304.25 56.84 310.98 85.199
2004 9 1 0 4 0.75 74.35 74.83 304.21 59.68 310.89 85.227
I have a basic script:
import sys
with open(sys.argv[1], r) as input_file:
newLines = []
for line in input_file:
newLines.append(newLine)
Which I will certainly change to account for mixed whitespace, but I don't know how to work with the wonky column headers.
Eventually I want my headers to just be:
Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith
Upvotes: 2
Views: 2748
Reputation: 114488
It looks like you can combine the header rows dynamically based on a word's position in the line. You can skip the first two lines, and combine the next two. If you do it right, you will be left with an iterator over a file stream that you can use to process the remainder of the data as you wish. You can convert it to a different format, or even import it into a pandas DataFrame directly.
To get the headers:
import re
def get_words_and_positions(line):
return [(match.start(), match.group()) in re.finditer(r'[\w.]+', line)]
with open('file.dat') as file:
iterator = iter(file)
# Skip two lines
next(iterator)
next(iterator)
# Get two header lines
header = get_words_and_positions(next(iterator)) + \
get_words_and_positions(next(iterator))
# Sort by positon
header.sort()
# Extract words
header = [word for pos, word in header]
You can now convert the file to a true CSV, or do something else with it. The important thing here is that you have iterator
pointing to the actual data in the file stream, and a bunch of dynamically loaded column headers.
To write the remainder to a CSV file, without having to load the entire thing into memory at once, use csv.writer
and the iterator from above:
import csv
...
with ...:
...
with open('outfile.csv', 'w') as output:
writer = csv.writer(output)
writer.writerow(header)
for line in iterator:
writer.writerow(re.split(r'\s+', line))
You can combine the nested output with
and the outer input with
into a single outer block to reduce the nesting levels:
with open('file.dat') as file, open('outputfile.csv', 'w') as output:
....
To read in a pandas DataFrame, you can just pass the file
object to pandas.read_csv
. Since the file stream is past the headers at this point, it will not give you any issues:
import pandas as pd
...
with ...:
...
df = pd.read_csv(file, sep=r'\s'+, header=None, names=header)
Upvotes: 1
Reputation: 645
This is answer for "Python - download and convert .dat to .csv [duplicate]". I couldn't post there so FYI you can get you exact output from here.
import urllib2
import csv
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
response = urllib2.urlopen(url)
readData = response.read()
strObj = filter(None,readData.splitlines())
strObj = [w.replace('\t', ' ') for w in strObj]
listB = []
for i in strObj:
listB.append(filter(None,i.split(" ")))
with open(r'c:/data2.csv','a') as f:
writer = csv.writer(f)
writer.writerows(listB)
Upvotes: 1
Reputation: 21663
Treat those header lines in the input file with all the disdain they deserve. (Or, in other words, read them and discard them.)
headers='Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith'
with open ( 'temp.dat') as input_file:
with open ('temp_2.csv', 'w') as output_file:
output_file.write('"%s"\n'%'","'.join(headers.split()))
for count, line in enumerate(input_file):
if count<4: continue
outLine = ','.join(line.split())
output_file.write(outLine + '\n')
Upvotes: 3