kma
kma

Reputation: 27

Read_CSV file faster

I'm having a bit of trouble reading 203 mb file quickly within the pandas dataframe. I want to know if there is a faster way I may be able to do this. Below is my function:

import pandas as pd
import numpy as np

def file(filename):
    df = pd.read_csv(filename, header=None, sep='delimiter', engine='python', skiprows=1)
    df = pd.DataFrame(df[0].str.split(',').tolist())
    df = df.drop(df.columns[range(4,70)], axis=1)
    df.columns = ['time','id1','id2','amount']
    return df

When i used the magic %timeit function it took about 6 seconds to read the file and upload it into python notebook. What can i do to speed this up?

Thanks!

Upvotes: 1

Views: 2781

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210972

UPDATE: looking at your logic you don't seem to need to use first sep='delimiter' as you will use (split) only the first (index=0) column, so you can simply do this:

df = pd.read_csv(filename, header=None, usecols=[0,1,2,3],
                 names=['time','id1','id2','amount'],
                 skipinitialspace=True, skiprows=1)

PS per default read_csv() will use C engine, which is faster, if sep is not longer than 1 character or if it's \s+

OLD answer:

First of all don't read columns that you don't need (or those that you are going to drop: df.drop(df.columns[range(4,70)], axis=1)):

df = pd.read_csv(filename, header=None, usecols=[0], names=['txt'],
                 sep='delimiter', skiprows=1)

then split a single parsed columns into four:

df[['time','id1','id2','amount']] = df.pop('txt').str.split(',', expand=True)

PS i would strongly recommend to convert your data to HDF5 format (if you can) and forget about all those problems with CSV files ;)

Upvotes: 2

Related Questions