Reputation: 27
I'm having a bit of trouble reading 203 mb file quickly within the pandas dataframe. I want to know if there is a faster way I may be able to do this. Below is my function:
import pandas as pd
import numpy as np
def file(filename):
df = pd.read_csv(filename, header=None, sep='delimiter', engine='python', skiprows=1)
df = pd.DataFrame(df[0].str.split(',').tolist())
df = df.drop(df.columns[range(4,70)], axis=1)
df.columns = ['time','id1','id2','amount']
return df
When i used the magic %timeit
function it took about 6 seconds to read the file and upload it into python notebook. What can i do to speed this up?
Thanks!
Upvotes: 1
Views: 2781
Reputation: 210972
UPDATE: looking at your logic you don't seem to need to use first sep='delimiter'
as you will use (split) only the first (index=0) column, so you can simply do this:
df = pd.read_csv(filename, header=None, usecols=[0,1,2,3],
names=['time','id1','id2','amount'],
skipinitialspace=True, skiprows=1)
PS per default read_csv()
will use C
engine, which is faster, if sep
is not longer than 1 character or if it's \s+
OLD answer:
First of all don't read columns that you don't need (or those that you are going to drop: df.drop(df.columns[range(4,70)], axis=1)
):
df = pd.read_csv(filename, header=None, usecols=[0], names=['txt'],
sep='delimiter', skiprows=1)
then split a single parsed columns into four:
df[['time','id1','id2','amount']] = df.pop('txt').str.split(',', expand=True)
PS i would strongly recommend to convert your data to HDF5 format (if you can) and forget about all those problems with CSV files ;)
Upvotes: 2