Reputation: 917

reading and doing calculation from .dat file in python

I need to read a .dat file in python which has 12 columns in total and millions of lines of rows. I need to divide column 2,3 and 4 with column 1 for my calculation. So before I load that .dat file, do I need to delete all the other unwanted columns? If not, how do I selectively declare the column and ask python to do the math?

an example of the .dat file would be data.dat

I am new to python , so a little instruction to open , read and calculation would be appreciated.

I have added the code I am using as a starter from your suggestion:

from sys import argv

import pandas as pd



script, filename = argv

txt = open(filename)

print "Here's your file %r:" % filename
print txt.read()

def your_func(row):
    return row['x-momentum'] / row['mass']

columns_to_keep = ['mass', 'x-momentum']
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

and also the error I get through it:

Traceback (most recent call last):
  File "flash.py", line 18, in <module>
    dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 612, in __init__
    self._make_engine(self.engine)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1119, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5030)
ValueError: No columns to parse from file

Upvotes: 13

Answers (6)

Mingming

Reputation: 382

Load the dat file as a list, and then convert the list to the desired data format, i.e., the numpy array.

For example, to load the Australian Credit Approval Data Set (found in UCI machine learning repository) in dat format, I use the following code:

dataset=[i.strip().split() for i in open("./data2/australian.dat").readlines()]

australian=np.full([len(dataset),len(dataset[0])],None)

for i in range(len(dataset)):

for j in range(len(dataset[0])):

    if j==0:
        australian[i,j]=int(dataset[i][j])    
    else:
        australian[i,j]=float(dataset[i][j])

np.save('./data2/australian.npy', australian)

Upvotes: 0

Nisarg Bhatt

Reputation: 389

train=pd.read_csv("Path",sep=" ::",header=None)

Now you can access the dat file.

train.columns=["A","B","C"]# Number of columns you can see in the dat file.

then you can use this as csv files.

Upvotes: 3

Jan Christoph Terasa

Reputation: 5935

The problem you face here is that the column header names have whitespaces in them. You need to fix/ignore that to make pandas.read_csv behave nicely. This will read the column header names into a list based on the fixed length of the field name strings:

import pandas

with open('flash.dat') as f:
    header = f.readline()[2:-1]
    header_fixed = [header[i*23:(i+1)*23].strip() for i in range(26)]
    header_fixed[0] = header_fixed[0][1:] # remove '#' from time

    # pandas doesn't handle "Infinity" properly, read Infinity as NaN, then convert back to infinity
    df = pandas.read_csv(f, sep='\s+', names=header_fixed, na_values="Infinity")
    df.fillna(pandas.np.inf, inplace=True)

# processing
df['new_column'] = df['x-momentum'] / df['mass']

Upvotes: 1

Parfait

Reputation: 107587

Consider using the general read_table() function (of which read_csv() is a special type) where pandas can easily import the specific .dat file specifying the space separator, sep='\s+'. Additionally, no defined function with apply() is needed for column by column calculation.

Below numpy is used to condition for division by zero. Also, the example .dat file's first column is #time and columns 2, 3, 4 are x-momentum, y-momentum, and mass (different expression in your code but revise as needed).

import pandas as pd
import numpy as np

columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'mass']
df = pd.read_table("flash.dat", sep="\s+", usecols=columns_to_keep)

df['mass_per_time'] = np.where(df['#time'] > 0, df['mass']/df['#time'], np.nan)
df['x-momentum_per_time'] = np.where(df['#time'] > 0, df['x-momentum']/df['#time'], np.nan)
df['y-momentum_per_time'] = np.where(df['#time'] > 0, df['y-momentum']/df['#time'], np.nan)

Upvotes: 2

Bill

Reputation: 1257

After looking at your flash.dat file, it's clear you need to do a little clean up before you process it. The following code converts it to a CSV file:

import csv

# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]

# write it as a new CSV file
with open("./flash.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(datContent)

Now, use Pandas to compute new column.

import pandas as pd

def your_func(row):
    return row['x-momentum'] / row['mass']

columns_to_keep = ['#time', 'x-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

print dataframe

Upvotes: 12

ppaulojr

Reputation: 3647

Try something like:

datContent = [i.strip().split() for i in open("filename.dat").readlines()]

Then you'll have your data in a list.

If you want to have something more sophisticated you can use Pandas, see the linked cookbook.

Upvotes: 2

reading and doing calculation from .dat file in python

Answers (6)

Related Questions