Rick
Rick

Reputation: 521

Pandas: read head and middle of a file

I just started with Pandas in Python and so far very good.

I have a big cvs file and I want to read only a portion of it. By the read_csv documentation (link) there's the option skiprows, which says:

skiprows : list-like or integer, default None

Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file

At first I thought that I could use this to read a 1st portion of my cvs file, process it, then read the 2nd portion and so on. But, when I read the 2nd portion, the headers are not there (because the header line was skipped).

I tried header=0 but as the documentation states:

header=0 denotes the first line of data rather than the first line of the file.

Then I saw that it's possible to read chunks of file. Sounds great, however the documentation is not that clear to me, so here are my questions:

  1. For each chunk, does the line index continue from value of the previous chunk plus 1, or does it restart to zero?
  2. Is the header set for each chunk?
  3. Is it possible to use the read_csv command with the skiprows option, and still read the head in the first line of the file? (I could still open the file, read the first line and use it as header in the names option, but I don't really like this).

Upvotes: 1

Views: 3493

Answers (1)

Anton Protopopov
Anton Protopopov

Reputation: 31682

Answers:

  1. It's restart to zero with each chunk
  2. Yes
  3. No

For question 3 you could use following, to keep the first row for header:

pd.read_csv('test.csv', skiprows=range(1, 10))

As @iled pointed out in the comment take a look to the example with chunks. Example:

import pandas as pd
import numpy as np
from io import StringIO

np.random.seed(10)

df1 = pd.DataFrame(np.random.randn(10,5), columns=['a','b','c','d','e'])

In [29]: df1
Out[29]: 
          a         b         c         d         e
0  1.331587  0.715279 -1.545400 -0.008384  0.621336
1 -0.720086  0.265512  0.108549  0.004291 -0.174600
2  0.433026  1.203037 -0.965066  1.028274  0.228630
3  0.445138 -1.136602  0.135137  1.484537 -1.079805
4 -1.977728 -1.743372  0.266070  2.384967  1.123691
5  1.672622  0.099149  1.397996 -0.271248  0.613204
6 -0.267317 -0.549309  0.132708 -0.476142  1.308473
7  0.195013  0.400210 -0.337632  1.256472 -0.731970
8  0.660232 -0.350872 -0.939433 -0.489337 -0.804591
9 -0.212698 -0.339140  0.312170  0.565153 -0.147420

data = df1.to_string(index=False)
# In your case you don't need sep because you are reading the ordinarily csv file 
chunks = pd.read_csv(StringIO(data), sep='\s+', chunksize=3)

In [40]: for chunk in chunks:
   ....:     print(chunk)
   ....:     
          a         b         c         d         e
0  1.331587  0.715279 -1.545400 -0.008384  0.621336
1 -0.720086  0.265512  0.108549  0.004291 -0.174600
2  0.433026  1.203037 -0.965066  1.028274  0.228630
          a         b         c         d         e
0  0.445138 -1.136602  0.135137  1.484537 -1.079805
1 -1.977728 -1.743372  0.266070  2.384967  1.123691
2  1.672622  0.099149  1.397996 -0.271248  0.613204
          a         b         c         d         e
0 -0.267317 -0.549309  0.132708 -0.476142  1.308473
1  0.195013  0.400210 -0.337632  1.256472 -0.731970
2  0.660232 -0.350872 -0.939433 -0.489337 -0.804591
          a        b        c         d        e
0 -0.212698 -0.33914  0.31217  0.565153 -0.14742

Upvotes: 1

Related Questions