Reputation: 44465
I tend to import .csv files into pandas, but sometimes I may get data in other formats to make DataFrame
objects.
Today, I just found out about read_table
as a "generic" importer for other formats, and wondered if there were significant performance differences between the various methods in pandas for reading .csv files, e.g. read_table
, from_csv
, read_excel
.
read_csv
? read_csv
much different than from_csv
for creating a DataFrame
?Upvotes: 22
Views: 28716
Reputation: 679
I've found that CSV and tab-delimited text (.txt) are equivalent in read and write speed, both are much faster than reading and writing MS Excel files. However, Excel format compresses the file size a lot.
For the same 320 MB CSV file (16 MB .xlsx) (i7-7700k, SSD, running Anaconda Python 3.5.3, Pandas 0.19.2)
Using the standard convention import pandas as pd
2 seconds to read .csv df = pd.read_csv('foo.csv')
(same for pd.read_table)
15.3 seconds to read .xlsx df = pd.read_excel('foo.xlsx')
10.5 seconds to write .csv df.to_csv('bar.csv', index=False)
(same for .txt)
34.5 seconds to write .xlsx df.to_excel('bar.xlsx', sheet_name='Sheet1', index=False)
To write your dataframes to tab-delimited text files you can use:
df.to_csv('bar.txt', sep='\t', index=False)
Upvotes: 14
Reputation: 526
read_table
is read_csv
with sep=','
replaced by sep='\t'
, they are two thin wrappers around the same function so the performance will be identical. read_excel
uses the xlrd
package to read xls and xlsx files into a DataFrame, it doesn't handle csv files.from_csv
calls read_table
, so no.Upvotes: 34