Frank
Frank

Reputation: 745

Converting mixed-format .DAT to .CSV (or anything else)

I have a large collection of DAT files that need to be converted (eventually to a unique file type). The DAT's have a mixed amount of whitespace between fields, and the column headers are on different lines. Any advice?

                    ALT_RAD
                                               ALT_RAD2
                 DIRECT        D_GLOBAL        U_GLOBAL          Zenith
Year Mn Dy Hr Mi        DIFFUSE2            D_IR            U_IR
2004  9  1  0  1    1.04   79.40   78.67  303.58   61.06  310.95  85.142
2004  9  1  0  2    0.71   74.36   73.91  303.80   57.82  310.92  85.171
2004  9  1  0  3    0.67   71.80   71.64  304.25   56.84  310.98  85.199
2004  9  1  0  4    0.75   74.35   74.83  304.21   59.68  310.89  85.227

I have a basic script:

import sys
with open(sys.argv[1], r) as input_file:
    newLines = []
    for line in input_file:
            newLines.append(newLine)

Which I will certainly change to account for mixed whitespace, but I don't know how to work with the wonky column headers.

Eventually I want my headers to just be:

Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith

Upvotes: 2

Views: 2748

Answers (3)

Mad Physicist
Mad Physicist

Reputation: 114488

It looks like you can combine the header rows dynamically based on a word's position in the line. You can skip the first two lines, and combine the next two. If you do it right, you will be left with an iterator over a file stream that you can use to process the remainder of the data as you wish. You can convert it to a different format, or even import it into a pandas DataFrame directly.

To get the headers:

import re

def get_words_and_positions(line):
    return [(match.start(), match.group()) in re.finditer(r'[\w.]+', line)]

with open('file.dat') as file:
    iterator = iter(file)
    # Skip two lines
    next(iterator)
    next(iterator)
    # Get two header lines
    header = get_words_and_positions(next(iterator)) + \
             get_words_and_positions(next(iterator))
    # Sort by positon
    header.sort()
    # Extract words
    header = [word for pos, word in header]

You can now convert the file to a true CSV, or do something else with it. The important thing here is that you have iterator pointing to the actual data in the file stream, and a bunch of dynamically loaded column headers.

To write the remainder to a CSV file, without having to load the entire thing into memory at once, use csv.writer and the iterator from above:

 import csv
 ...
 with ...:
 ...
    with open('outfile.csv', 'w') as output:
        writer = csv.writer(output)
        writer.writerow(header)
        for line in iterator:
            writer.writerow(re.split(r'\s+', line))

You can combine the nested output with and the outer input with into a single outer block to reduce the nesting levels:

with open('file.dat') as file, open('outputfile.csv', 'w') as output:
    ....

To read in a pandas DataFrame, you can just pass the file object to pandas.read_csv. Since the file stream is past the headers at this point, it will not give you any issues:

import pandas as pd
...
with ...:
    ...
    df = pd.read_csv(file, sep=r'\s'+, header=None, names=header)

Upvotes: 1

cSharma
cSharma

Reputation: 645

This is answer for "Python - download and convert .dat to .csv [duplicate]". I couldn't post there so FYI you can get you exact output from here.

import urllib2
import csv
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
response = urllib2.urlopen(url)
readData = response.read()
strObj = filter(None,readData.splitlines())

strObj = [w.replace('\t', '  ') for w in strObj]

listB = []
for i in strObj:
    listB.append(filter(None,i.split("  ")))
with open(r'c:/data2.csv','a') as f:
    writer = csv.writer(f)
    writer.writerows(listB)

Upvotes: 1

Bill Bell
Bill Bell

Reputation: 21663

Treat those header lines in the input file with all the disdain they deserve. (Or, in other words, read them and discard them.)

headers='Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith'
with open ( 'temp.dat') as input_file:
    with open ('temp_2.csv', 'w') as output_file:
        output_file.write('"%s"\n'%'","'.join(headers.split()))
        for count, line in enumerate(input_file):
            if count<4: continue
            outLine = ','.join(line.split())
            output_file.write(outLine + '\n')

Upvotes: 3

Related Questions