Reputation: 917
I need to read a .dat file in python which has 12 columns in total and millions of lines of rows. I need to divide column 2,3 and 4 with column 1 for my calculation. So before I load that .dat file, do I need to delete all the other unwanted columns? If not, how do I selectively declare the column and ask python to do the math?
an example of the .dat file would be data.dat
I am new to python , so a little instruction to open , read and calculation would be appreciated.
I have added the code I am using as a starter from your suggestion:
from sys import argv
import pandas as pd
script, filename = argv
txt = open(filename)
print "Here's your file %r:" % filename
print txt.read()
def your_func(row):
return row['x-momentum'] / row['mass']
columns_to_keep = ['mass', 'x-momentum']
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
and also the error I get through it:
Traceback (most recent call last):
File "flash.py", line 18, in <module>
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 612, in __init__
self._make_engine(self.engine)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1119, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5030)
ValueError: No columns to parse from file
Upvotes: 13
Views: 173008
Reputation: 382
Load the dat file as a list, and then convert the list to the desired data format, i.e., the numpy array.
For example, to load the Australian Credit Approval Data Set (found in UCI machine learning repository) in dat format, I use the following code:
dataset=[i.strip().split() for i in open("./data2/australian.dat").readlines()]
australian=np.full([len(dataset),len(dataset[0])],None)
for i in range(len(dataset)):
for j in range(len(dataset[0])):
if j==0:
australian[i,j]=int(dataset[i][j])
else:
australian[i,j]=float(dataset[i][j])
np.save('./data2/australian.npy', australian)
Upvotes: 0
Reputation: 389
train=pd.read_csv("Path",sep=" ::",header=None)
Now you can access the dat file.
train.columns=["A","B","C"]# Number of columns you can see in the dat file.
then you can use this as csv files.
Upvotes: 3
Reputation: 5935
The problem you face here is that the column header names have whitespaces in them. You need to fix/ignore that to make pandas.read_csv
behave nicely. This will read the column header names into a list based on the fixed length of the field name strings:
import pandas
with open('flash.dat') as f:
header = f.readline()[2:-1]
header_fixed = [header[i*23:(i+1)*23].strip() for i in range(26)]
header_fixed[0] = header_fixed[0][1:] # remove '#' from time
# pandas doesn't handle "Infinity" properly, read Infinity as NaN, then convert back to infinity
df = pandas.read_csv(f, sep='\s+', names=header_fixed, na_values="Infinity")
df.fillna(pandas.np.inf, inplace=True)
# processing
df['new_column'] = df['x-momentum'] / df['mass']
Upvotes: 1
Reputation: 107587
Consider using the general read_table()
function (of which read_csv()
is a special type) where pandas can easily import the specific .dat file specifying the space separator, sep='\s+'
. Additionally, no defined function with apply()
is needed for column by column calculation.
Below numpy is used to condition for division by zero. Also, the example .dat file's first column is #time and columns 2, 3, 4 are x-momentum, y-momentum, and mass (different expression in your code but revise as needed).
import pandas as pd
import numpy as np
columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'mass']
df = pd.read_table("flash.dat", sep="\s+", usecols=columns_to_keep)
df['mass_per_time'] = np.where(df['#time'] > 0, df['mass']/df['#time'], np.nan)
df['x-momentum_per_time'] = np.where(df['#time'] > 0, df['x-momentum']/df['#time'], np.nan)
df['y-momentum_per_time'] = np.where(df['#time'] > 0, df['y-momentum']/df['#time'], np.nan)
Upvotes: 2
Reputation: 1257
After looking at your flash.dat
file, it's clear you need to do a little clean up before you process it. The following code converts it to a CSV file:
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]
# write it as a new CSV file
with open("./flash.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
Now, use Pandas to compute new column.
import pandas as pd
def your_func(row):
return row['x-momentum'] / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
print dataframe
Upvotes: 12