Reputation: 521
I just started with Pandas in Python and so far very good.
I have a big cvs file and I want to read only a portion of it. By the read_csv documentation (link) there's the option skiprows
, which says:
skiprows : list-like or integer, default None
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file
At first I thought that I could use this to read a 1st portion of my cvs file, process it, then read the 2nd portion and so on. But, when I read the 2nd portion, the headers are not there (because the header line was skipped).
I tried header=0
but as the documentation states:
header=0 denotes the first line of data rather than the first line of the file.
Then I saw that it's possible to read chunks of file. Sounds great, however the documentation is not that clear to me, so here are my questions:
read_csv
command with the skiprows
option, and still read the head in the first line of the file? (I could still open the file, read the first line and use it as header in the names
option, but I don't really like this).Upvotes: 1
Views: 3493
Reputation: 31682
Answers:
For question 3 you could use following, to keep the first row for header:
pd.read_csv('test.csv', skiprows=range(1, 10))
As @iled pointed out in the comment take a look to the example with chunks. Example:
import pandas as pd
import numpy as np
from io import StringIO
np.random.seed(10)
df1 = pd.DataFrame(np.random.randn(10,5), columns=['a','b','c','d','e'])
In [29]: df1
Out[29]:
a b c d e
0 1.331587 0.715279 -1.545400 -0.008384 0.621336
1 -0.720086 0.265512 0.108549 0.004291 -0.174600
2 0.433026 1.203037 -0.965066 1.028274 0.228630
3 0.445138 -1.136602 0.135137 1.484537 -1.079805
4 -1.977728 -1.743372 0.266070 2.384967 1.123691
5 1.672622 0.099149 1.397996 -0.271248 0.613204
6 -0.267317 -0.549309 0.132708 -0.476142 1.308473
7 0.195013 0.400210 -0.337632 1.256472 -0.731970
8 0.660232 -0.350872 -0.939433 -0.489337 -0.804591
9 -0.212698 -0.339140 0.312170 0.565153 -0.147420
data = df1.to_string(index=False)
# In your case you don't need sep because you are reading the ordinarily csv file
chunks = pd.read_csv(StringIO(data), sep='\s+', chunksize=3)
In [40]: for chunk in chunks:
....: print(chunk)
....:
a b c d e
0 1.331587 0.715279 -1.545400 -0.008384 0.621336
1 -0.720086 0.265512 0.108549 0.004291 -0.174600
2 0.433026 1.203037 -0.965066 1.028274 0.228630
a b c d e
0 0.445138 -1.136602 0.135137 1.484537 -1.079805
1 -1.977728 -1.743372 0.266070 2.384967 1.123691
2 1.672622 0.099149 1.397996 -0.271248 0.613204
a b c d e
0 -0.267317 -0.549309 0.132708 -0.476142 1.308473
1 0.195013 0.400210 -0.337632 1.256472 -0.731970
2 0.660232 -0.350872 -0.939433 -0.489337 -0.804591
a b c d e
0 -0.212698 -0.33914 0.31217 0.565153 -0.14742
Upvotes: 1