Vincent
Vincent

Reputation: 23

Python & Pandas. How to work with a subset of the "chunks" in a TextFileReader object?

I need to import a large .txt file (approx. 10GB) to do some calculations. I'm using Pandas in Python 2.7.

Basically, I need to construct sums and averages of certain series (columns), conditional on the value of other series. To be more precise: I have basic information on individuals living in a country, and, for example, I want to take the average age of the people in each municipality.

I cannot import the whole file (because it is too big), so I am doing it in "chunks" (using read_table, chunksize). For each calculation, I don't need all of the chunks, just a subset of them.

Since the information might not be ordered, I first iterate over all chunks to identify which of them have information for each of the municipalities. Hence, for each municipality I have a list with the indices of the chunks that contain at least one observation belonging to it.

I would then like to use this list to select only those chunks, but I'm not being able to do it in a fast way. The only thing that seems to work is to iterate over all chunks again.

Is there a way to directly select a subset of the "chunks" in a TextFileReader object without having to iterate over all of them?

Upvotes: 2

Views: 4233

Answers (2)

NikhilR
NikhilR

Reputation: 1

You can use:

for df1 in pd.read_csv('______.csv',chunksize=5):
    print(df1)
df

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

I would try to do it this way:

res = \
pd.concat([df.assign(age=(pd.datetime.now() - df.dob).astype('m8[Y]').astype(int))
             .groupby(['country','municipality'])['age'].agg(['size','sum']).reset_index()
           for df in pd.read_csv('/path/to/file.txt', sep=..., chunksize=10**5) ],
          ignore_index=True)

res = res.groupby(['country','municipality'], as_index=False).sum()

This will give you a total number of individuals in each municipality (size column) and a sum of their ages (sum column).

sum/size - will give you an average age per municipality

UPDATE: you can use the following trick in order to calculate the age on the fly:

In [164]: df
Out[164]:
   country municipality        dob
0  Ukraine           m1 1950-01-01
1  Ukraine           m1 1960-12-14
2      USA           m2 1971-11-27
3      USA           m2 1982-11-09
4      USA           m3 1993-10-22
5  Germany           m1 2004-10-04
6  Germany           m2 2015-09-17

In [165]: df.assign(age=(pd.datetime.now() - df.dob).astype('m8[Y]').astype(int))
Out[165]:
   country municipality        dob  age
0  Ukraine           m1 1950-01-01   67
1  Ukraine           m1 1960-12-14   56
2      USA           m2 1971-11-27   45
3      USA           m2 1982-11-09   34
4      USA           m3 1993-10-22   23
5  Germany           m1 2004-10-04   12
6  Germany           m2 2015-09-17    1

UPDATE2: pd.read_csv() returns a pandas.io.parsers.TextFileReader instead of DataFrame as soon as you specify chunksize:

In [6]: reader = pd.read_csv(r'D:\temp\.data\1.csv', chunksize=3, sep='\s+')

In [7]: type(reader)
Out[7]: pandas.io.parsers.TextFileReader

get 2 rows

In [12]: reader.get_chunk(2)
Out[12]:
      foo  foo.1   bar  bar.1  spam  spam.1
foo  0.00   0.35  0.83   0.84  0.90    0.89
foo  0.35   0.00  0.86   0.85  0.92    0.91

get next 3 rows

In [13]: reader.get_chunk(3)
Out[13]:
       foo  foo.1   bar  bar.1  spam  spam.1
bar   0.83   0.86  0.00   0.25  0.88    0.87
bar   0.84   0.85  0.25   0.00  0.82    0.86
spam  0.90   0.92  0.88   0.82  0.00    0.50

Upvotes: 2

Related Questions