Reputation: 1702
I have a collection of fairly large CSV files (~1M records per, 15MB+) whose contents I need to extract, reformat and then append to a template text file.
I've written the following python script to do this and it does what I want, but runs very slowly (processing took ~15 mins for a single file) and I have several thousand CSV files to process.
The script I wrote is below:
import pandas as pd
import shutil as s
from datetime import datetime as dt
dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'
df = pd.read_csv(dir + f)
df["DataDate"] = pd.to_datetime(df["DataDate"]).view('int64')
df["Expiration"] = pd.to_datetime(df["Expiration"], format='%m/%d/%Y')
df["measurement"] = ['options' for t in range(len(df))]
lines = [str(df["measurement"][d])
+ ",type=options"
+ " "
+ "ticker=" + str(df["UnderlyingSymbol"][d])
+ ",symbol=" + str(df["OptionSymbol"][d])
+ ",expDate=" + str(df["Expiration"][d])
+ ",cont=" + str(df["Type"][d])
+ ",price=" + str(df["UnderlyingPrice"][d])
+ ",strike=" + str(df["Strike"][d])
+ ",last=" + str(df["Last"][d])
+ ",bid=" + str(df["Bid"][d])
+ ",ask=" + str(df["Ask"][d])
+ ",volume=" + str(df["Volume"][d])
+ ",OI=" + str(df["OpenInterest"][d])
+ ",IV=" + str(df["IV"][d])
+ ",delta=" + str(df["Delta"][d])
+ ",gamma=" + str(df["Gamma"][d])
+ ",theta=" + str(df["Theta"][d])
+ ",vega=" + str(df["Vega"][d])
+ ",aka=" + str(df["AKA"][d])
+ " " + str(df["DataDate"][d]) for d in range(len(df))]
a = s.copyfile(dir + outFile, dir + 'import.txt')
file = open(a, 'a+')
for item in lines:
file.write("%s\n" % item)
file.close()
The importbase.txt
file is the template file. I need to append my data to its contents, and then save that as import.txt
for processing. Samples of importbase, import and my CSV below:
importbase.txt
# DDL
CREATE DATABASE import
# DML
# CONTEXT-DATABASE: import
import.txt
# DDL
CREATE DATABASE import
# DML
# CONTEXT-DATABASE: import
options,type=options ticker=A,symbol=A020420C00015000,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=15.0,last=0.0,bid=21.0,ask=22.3,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0.0,theta=0.6848,vega=0.0aka=ADC 1017619200000000000
options,type=options ticker=A,symbol=A020420P00015000,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=15.0,last=0.0,bid=0.0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0.0,gamma=0.0,theta=0.0,vega=0.0aka=APC 1017619200000000000
options,type=options ticker=A,symbol=A020420C00017500,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=17.5,last=0.0,bid=18.7,ask=19.5,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0.0,theta=0.6371,vega=0.0aka=ADW 1017619200000000000
options,type=options ticker=A,symbol=A020420P00017500,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=17.5,last=0.0,bid=0.0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0.0,gamma=0.0,theta=0.0,vega=0.0aka=APW 1017619200000000000
20020401.csv
UnderlyingSymbol,UnderlyingPrice,Exchange,OptionSymbol,OptionExt,Type,Expiration,DataDate,Strike,Last,Bid,Ask,Volume,OpenInterest,IV,Delta,Gamma,Theta,Vega,AKA
A,36.53,*,A020420C00015000,,call,04/20/2002,04/01/2002,15,0,21,22.3,0,0,0.6145,0.9987,0,0.6848,0,ADC
A,36.53,*,A020420P00015000,,put,04/20/2002,04/01/2002,15,0,0,0.1,0,0,0.5964,0,0,0,0,APC
A,36.53,*,A020420C00017500,,call,04/20/2002,04/01/2002,17.5,0,18.7,19.5,0,0,0.6145,0.9987,0,0.6371,0,ADW
A,36.53,*,A020420P00017500,,put,04/20/2002,04/01/2002,17.5,0,0,0.1,0,0,0.5964,0,0,0,0,APW
A,36.53,*,A020420C00020000,,call,04/20/2002,04/01/2002,20,0,16.2,17,0,0,0.6145,0.9987,0,0.5882,0.0002,ADD
A,36.53,*,A020420P00020000,,put,04/20/2002,04/01/2002,20,0,0,0.1,0,0,0.5964,0,0,-0.0006,0.0001,APD
A,36.53,*,A020420C00022500,,call,04/20/2002,04/01/2002,22.5,5.7,13.7,14.5,0,5,0.6145,0.9985,0.0001,0.5092,0.0053,ADX
A,36.53,*,A020420P00022500,,put,04/20/2002,04/01/2002,22.5,0,0,0.1,0,1545,0.5964,-0.0001,0.0001,-0.0214,0.0036,APX
A,36.53,*,A020420C00025000,,call,04/20/2002,04/01/2002,25,11.6,11.2,12,0,545,0.6145,0.9963,0.0015,0.1189,0.0612,ADE
A,36.53,*,A020420P00025000,,put,04/20/2002,04/01/2002,25,0,0,0.1,0,882,0.5964,-0.0019,0.0012,-0.2882,0.0485,APE
A,36.53,*,A020420C00030000,,call,04/20/2002,04/01/2002,30,6.5,6.4,6.9,80,2357,0.6145,0.9314,0.0259,-6.1328,1.0651,ADF
I use bash a fair amount but wonder if something like sed or awk may be better suited to a lower level task like this where processing time is likely to be a concern. Any guidance on approach is appreciated.
Upvotes: 0
Views: 833
Reputation: 204310
Here's how to do the conversion in awk (using GNU awk for mktime()):
$ cat tst.awk
BEGIN {
n = split(" \
ticker UnderlyingSymbol \
symbol OptionSymbol \
expDate Expiration \
cont Type \
price UnderlyingPrice \
strike Strike \
last Last \
bid Bid \
ask Ask \
volume Volume \
OI OpenInterest \
IV IV \
delta Delta \
gamma Gamma \
theta Theta \
vega Vega \
aka AKA \
DataDate DataDate \
", outInNameMap)
for (i=1; i<n; i+=2) {
outFldName = outInNameMap[i]
inFldName = outInNameMap[i+1]
inNames2outNrs[inFldName] = ++numOutFlds
outNrs2outNames[numOutFlds] = outFldName
}
dateFmt["expDate"] = "iso8601"
dateFmt["DataDate"] = "epoch"
FS=OFS=","
}
FNR == 1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
inFldName = $inFldNr
outFldNr = inNames2outNrs[inFldName]
outNrs2inNrs[outFldNr] = inFldNr
}
next
}
{
printf "options,type=options"
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
outFldName = outNrs2outNames[outFldNr]
inFldNr = outNrs2inNrs[outFldNr]
fldVal = $inFldNr
if ( outFldName in dateFmt) {
split(fldVal,d,"/")
if ( dateFmt[outFldName] == "iso8601" ) {
fldVal = sprintf("%s-%s-%s 00:00:00", d[3], d[1], d[2])
}
else if ( dateFmt[outFldName] == "epoch" ) {
fldVal = mktime(sprintf("%s %s %s 00 00 00", d[3], d[1], d[2]))
}
}
pfx = (outFldNr==numOutFlds ? " " : (outFldNr>1 ? OFS : " ") outFldName "=")
printf "%s%s", pfx, fldVal
}
print ""
}
.
$ awk -f tst.awk file.csv
options,type=options ticker=A,symbol=A020420C00015000,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=15,last=0,bid=21,ask=22.3,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0,theta=0.6848,vega=0,aka=ADC 1017640800
options,type=options ticker=A,symbol=A020420P00015000,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=15,last=0,bid=0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0,gamma=0,theta=0,vega=0,aka=APC 1017640800
options,type=options ticker=A,symbol=A020420C00017500,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=17.5,last=0,bid=18.7,ask=19.5,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0,theta=0.6371,vega=0,aka=ADW 1017640800
options,type=options ticker=A,symbol=A020420P00017500,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=17.5,last=0,bid=0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0,gamma=0,theta=0,vega=0,aka=APW 1017640800
options,type=options ticker=A,symbol=A020420C00020000,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=20,last=0,bid=16.2,ask=17,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0,theta=0.5882,vega=0.0002,aka=ADD 1017640800
options,type=options ticker=A,symbol=A020420P00020000,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=20,last=0,bid=0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0,gamma=0,theta=-0.0006,vega=0.0001,aka=APD 1017640800
options,type=options ticker=A,symbol=A020420C00022500,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=22.5,last=5.7,bid=13.7,ask=14.5,volume=0,OI=5,IV=0.6145,delta=0.9985,gamma=0.0001,theta=0.5092,vega=0.0053,aka=ADX 1017640800
options,type=options ticker=A,symbol=A020420P00022500,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=22.5,last=0,bid=0,ask=0.1,volume=0,OI=1545,IV=0.5964,delta=-0.0001,gamma=0.0001,theta=-0.0214,vega=0.0036,aka=APX 1017640800
options,type=options ticker=A,symbol=A020420C00025000,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=25,last=11.6,bid=11.2,ask=12,volume=0,OI=545,IV=0.6145,delta=0.9963,gamma=0.0015,theta=0.1189,vega=0.0612,aka=ADE 1017640800
options,type=options ticker=A,symbol=A020420P00025000,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=25,last=0,bid=0,ask=0.1,volume=0,OI=882,IV=0.5964,delta=-0.0019,gamma=0.0012,theta=-0.2882,vega=0.0485,aka=APE 1017640800
options,type=options ticker=A,symbol=A020420C00030000,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=30,last=6.5,bid=6.4,ask=6.9,volume=80,OI=2357,IV=0.6145,delta=0.9314,gamma=0.0259,theta=-6.1328,vega=1.0651,aka=ADF 1017640800
I'm not exactly sure what that final value on your output lines is, it seems to be some variation of seconds since the epoch but I'll leave it as an exercise for you to fix that if necessary. I used more temporary variables than necessary and used clear names for all of them so it'd be obvious what's happening and so wouldn't need commenting to explain it and you should be able to build on it easily.
Upvotes: 2
Reputation: 7837
You could do it with awk, but you're better off with a CSV parser and, since you're using Python already, using it better is probably the answer.
The meat of your problem is to reproduce the CSV as a comma-separated list of name-value pairs. For that, this will do the job, and I would expect it to perform pretty well.
#! /usr/bin/env python3
import csv
fieldnames = ('type', 'ticker', 'symbol', 'expDate', 'cont',
'price', 'strike', 'last', 'bid', 'ask', 'volume',
'OI', 'IV', 'delta', 'gamma',
'UnderlyingSymbol', 'UnderlyingPrice', 'Exchange',
'OptionSymbol', 'OptionExt', 'Type', 'Expiration', 'DataDate',
'Strike', 'Last', 'Bid', 'Ask', 'Volume',
'OpenInterest', 'IV', 'Delta', 'Gamma', 'Theta', 'Vega', 'AKA')
with open('input.csv', newline='') as csvfile:
rows = csv.DictReader(csvfile, fieldnames)
for i, row in enumerate(rows):
if i == 0:
continue
line = ','.join( [ '%s=%s' % (k,v) for k,v in row.items() ] )
print( 'options,%s' % line )
producing output:
$ ./convert.py | nl
1 options,type= A,ticker=36.53,symbol=*,expDate=A020420P00015000,cont=,price=put,strike=04/20/2002,last=04/01/2002,bid=15,ask=0,volume=0,OI=0.1,IV=None,delta=0,gamma=0.5964,UnderlyingSymbol=0,UnderlyingPrice=0,Exchange=0,OptionSymbol=0,OptionExt=APC,Type=None,Expiration=None,DataDate=None,Strike=None,Last=None,Bid=None,Ask=None,Volume=None,OpenInterest=None,Delta=None,Gamma=None,Theta=None,Vega=None,AKA=None
2 options,type= A,ticker=36.53,symbol=*,expDate=A020420C00017500,cont=,price=call,strike=04/20/2002,last=04/01/2002,bid=17.5,ask=0,volume=18.7,OI=19.5,IV=None,delta=0,gamma=0.6145,UnderlyingSymbol=0.9987,UnderlyingPrice=0,Exchange=0.6371,OptionSymbol=0,OptionExt=ADW,Type=None,Expiration=None,DataDate=None,Strike=None,Last=None,Bid=None,Ask=None,Volume=None,OpenInterest=None,Delta=None,Gamma=None,Theta=None,Vega=None,AKA=None
3 options,type= A,ticker=36.53,symbol=*,expDate=A020420P00017500,cont=,price=put,strike=04/20/2002,last=04/01/2002,bid=17.5,ask=0,volume=0,OI=0.1,IV=None,delta=0,gamma=0.5964,UnderlyingSymbol=0,UnderlyingPrice=0,Exchange=0,OptionSymbol=0,OptionExt=APW,Type=None,Expiration=None,DataDate=None,Strike=None,Last=None,Bid=None,Ask=None,Volume=None,OpenInterest=None,Delta=None,Gamma=None,Theta=None,Vega=None,AKA=None
4 options,type= A,ticker=36.53,symbol=*,expDate=A020420C00020000,cont=,price=call,strike=04/20/2002,last=04/01/2002,bid=20,ask=0,volume=16.2,OI=17,IV=None,delta=0,gamma=0.6145,UnderlyingSymbol=0.9987,UnderlyingPrice=0,Exchange=0.5882,OptionSymbol=0.0002,OptionExt=ADD,Type=None,Expiration=None,DataDate=None,Strike=None,Last=None,Bid=None,Ask=None,Volume=None,OpenInterest=None,Delta=None,Gamma=None,Theta=None,Vega=None,AKA=None
Upvotes: 2
Reputation: 77387
I hesitate to post this as the answer because its just speculation. But you will likely save time by iterating the dataframe rows and formatting a single string instead of concatenating a bunch of strings.
import pandas as pd
import shutil as s
from datetime import datetime as dt
dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'
df = pd.read_csv(dir + f)
df["DataDate"] = pd.to_datetime(df["DataDate"]).view('int64')
df["Expiration"] = pd.to_datetime(df["Expiration"], format='%m/%d/%Y')
df["measurement"] = ['options' for t in range(len(df))]
a = s.copyfile(dir + outFile, dir + 'import.txt')
with open(a, 'a+') as file:
for index, row in df.iterrows():
file.write(
"{measurement},type=options "
"ticker={UnderlyingSymbol}"
",symbol={OptionSymbol}"
",expDate={Expiration}"
",cont={Type}"
",price={UnderlyingPrice}"
",strike={Strike}"
",last={Last}"
",bid={Bid}"
",ask={Ask}"
",volume={Volume}"
",OI={OpenInterest}"
",IV={IV}"
",delta={Delta}"
",gamma={Gamma}"
",theta={Theta}"
",vega={Vega}"
",aka={AKA}"
" {DataDate}".format_map(row))
In fact, there isn't a real advantage to pandas here. You take up a lot of space building the dataframe but everything needs to pass through a python variable for formatting anyway, so just use the regular CSV module
import csv
import shutil as s
from datetime import datetime as dt
dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'
a = s.copyfile(dir + outFile, dir + 'import.txt')
with open(dir + f) as in_fp, open(a, "a") as out_fp:
reader = csv.DictReader(in_fp)
for row in reader:
# todo: figure out conversions outside pandas
#row["DataDate"] = ?
#row["Expiration"] = ?
row["measurement"] = "options"
file.write(
"{measurement},type=options "
"ticker={UnderlyingSymbol}"
",symbol={OptionSymbol}"
",expDate={Expiration}"
",cont={Type}"
",price={UnderlyingPrice}"
",strike={Strike}"
",last={Last}"
",bid={Bid}"
",ask={Ask}"
",volume={Volume}"
",OI={OpenInterest}"
",IV={IV}"
",delta={Delta}"
",gamma={Gamma}"
",theta={Theta}"
",vega={Vega}"
",aka={AKA}"
" {DataDate}".format_map(row))
Upvotes: 3