Andy
Andy

Reputation: 50640

How can I read only the header column of a CSV file using Python?

I am looking for a a way to read just the header row of a large number of large CSV files.

Using Pandas, I have this method available, for each csv file:

>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns

I could do this with just the csv module:

>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames

The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.

My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.

How can I extract only the header row of a CSV file, quickly?

Upvotes: 40

Views: 128305

Answers (10)

blessedk
blessedk

Reputation: 71

if you are only interested in the headers and would like to use pandas, the only extra thing you need to pass in apart from the csv file name is "nrows=0":

headers = pd.read_csv("test.csv", nrows=0)

Upvotes: 1

Sway Wu
Sway Wu

Reputation: 391

it is easy you can use this:

df = pd.read_csv("path.csv", skiprows=0, nrows=2)
df.columns.to_list()

In this case you can only read really few row for get your header

Upvotes: 1

Saurabh Chandra Patel
Saurabh Chandra Patel

Reputation: 13644

you have missed nrows=1 param to read_csv

>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns

Upvotes: 7

Aaksh Kumar
Aaksh Kumar

Reputation: 9

import pandas as pd

get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)

Upvotes: -1

Jarno
Jarno

Reputation: 7242

Expanding on the answer given by Jeff It is now possbile to use pandas without actually reading any rows.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')

In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']

pandas can have the advantage that it deals more gracefully with CSV encodings.

Upvotes: 46

Muhieddine Alkousy
Muhieddine Alkousy

Reputation: 11

it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string. you can transform all the collected strings together according to your needs:

for filename in glob.glob(files_path+"\*.csv"):
    with open(filename) as f:
        first_line = f.readline()

Upvotes: 1

mdubez
mdubez

Reputation: 3144

What about:

pandas.read_csv(PATH_TO_CSV, nrows=1).columns

That'll read the first row only and return the columns found.

Upvotes: 14

Tyler
Tyler

Reputation: 1050

I might be a little late to the party but here's one way to do it using just the Python standard library. When dealing with text data, I prefer to use Python 3 because unicode. So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.

import csv    

with open(fpath, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames

Hopefully that helps!

Upvotes: 27

Jon Clements
Jon Clements

Reputation: 142256

I've used iglob as an example to search for the .csv files, but one way is to use a set, then adjust as necessary, eg:

import csv
from glob import iglob

unique_headers = set()
for filename in iglob('*.csv'):
    with open(filename, 'rb') as fin:
        csvin = csv.reader(fin)
        unique_headers.update(next(csvin, []))

Upvotes: 15

Jeff
Jeff

Reputation: 129068

Here's one way. You get 1 row.

In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')

In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]: 
          a         b         c         d
0  0.365453  0.633631 -1.917368 -1.996505

Upvotes: 13

Related Questions