pylang
pylang

Reputation: 44465

Performance difference in pandas read_table vs. read_csv vs. from_csv vs. read_excel?

I tend to import .csv files into pandas, but sometimes I may get data in other formats to make DataFrame objects.

Today, I just found out about read_table as a "generic" importer for other formats, and wondered if there were significant performance differences between the various methods in pandas for reading .csv files, e.g. read_table, from_csv, read_excel.

  1. Do these other methods have better performance than read_csv?
  2. Is read_csv much different than from_csv for creating a DataFrame?

Upvotes: 22

Views: 28716

Answers (2)

griffinc
griffinc

Reputation: 679

I've found that CSV and tab-delimited text (.txt) are equivalent in read and write speed, both are much faster than reading and writing MS Excel files. However, Excel format compresses the file size a lot.


For the same 320 MB CSV file (16 MB .xlsx) (i7-7700k, SSD, running Anaconda Python 3.5.3, Pandas 0.19.2)

Using the standard convention import pandas as pd

2 seconds to read .csv df = pd.read_csv('foo.csv') (same for pd.read_table)

15.3 seconds to read .xlsx df = pd.read_excel('foo.xlsx')

10.5 seconds to write .csv df.to_csv('bar.csv', index=False) (same for .txt)

34.5 seconds to write .xlsx df.to_excel('bar.xlsx', sheet_name='Sheet1', index=False)


To write your dataframes to tab-delimited text files you can use:

df.to_csv('bar.txt', sep='\t', index=False)

Upvotes: 14

Daniel Boline
Daniel Boline

Reputation: 526

  1. read_table is read_csv with sep=',' replaced by sep='\t', they are two thin wrappers around the same function so the performance will be identical. read_excel uses the xlrd package to read xls and xlsx files into a DataFrame, it doesn't handle csv files.
  2. from_csv calls read_table, so no.

Upvotes: 34

Related Questions