Reputation: 23
I need to import a large .txt
file (approx. 10GB) to do some calculations. I'm using Pandas in Python 2.7.
Basically, I need to construct sums and averages of certain series (columns), conditional on the value of other series. To be more precise: I have basic information on individuals living in a country, and, for example, I want to take the average age of the people in each municipality.
I cannot import the whole file (because it is too big), so I am doing it in "chunks" (using read_table
, chunksize
).
For each calculation, I don't need all of the chunks, just a subset of them.
Since the information might not be ordered, I first iterate over all chunks to identify which of them have information for each of the municipalities. Hence, for each municipality I have a list with the indices of the chunks that contain at least one observation belonging to it.
I would then like to use this list to select only those chunks, but I'm not being able to do it in a fast way. The only thing that seems to work is to iterate over all chunks again.
Is there a way to directly select a subset of the "chunks" in a TextFileReader
object without having to iterate over all of them?
Upvotes: 2
Views: 4233
Reputation: 1
You can use:
for df1 in pd.read_csv('______.csv',chunksize=5):
print(df1)
df
Upvotes: 0
Reputation: 210882
I would try to do it this way:
res = \
pd.concat([df.assign(age=(pd.datetime.now() - df.dob).astype('m8[Y]').astype(int))
.groupby(['country','municipality'])['age'].agg(['size','sum']).reset_index()
for df in pd.read_csv('/path/to/file.txt', sep=..., chunksize=10**5) ],
ignore_index=True)
res = res.groupby(['country','municipality'], as_index=False).sum()
This will give you a total number of individuals in each municipality (size
column) and a sum of their ages (sum
column).
sum/size
- will give you an average age per municipality
UPDATE: you can use the following trick in order to calculate the age on the fly:
In [164]: df
Out[164]:
country municipality dob
0 Ukraine m1 1950-01-01
1 Ukraine m1 1960-12-14
2 USA m2 1971-11-27
3 USA m2 1982-11-09
4 USA m3 1993-10-22
5 Germany m1 2004-10-04
6 Germany m2 2015-09-17
In [165]: df.assign(age=(pd.datetime.now() - df.dob).astype('m8[Y]').astype(int))
Out[165]:
country municipality dob age
0 Ukraine m1 1950-01-01 67
1 Ukraine m1 1960-12-14 56
2 USA m2 1971-11-27 45
3 USA m2 1982-11-09 34
4 USA m3 1993-10-22 23
5 Germany m1 2004-10-04 12
6 Germany m2 2015-09-17 1
UPDATE2: pd.read_csv()
returns a pandas.io.parsers.TextFileReader
instead of DataFrame
as soon as you specify chunksize
:
In [6]: reader = pd.read_csv(r'D:\temp\.data\1.csv', chunksize=3, sep='\s+')
In [7]: type(reader)
Out[7]: pandas.io.parsers.TextFileReader
get 2 rows
In [12]: reader.get_chunk(2)
Out[12]:
foo foo.1 bar bar.1 spam spam.1
foo 0.00 0.35 0.83 0.84 0.90 0.89
foo 0.35 0.00 0.86 0.85 0.92 0.91
get next 3 rows
In [13]: reader.get_chunk(3)
Out[13]:
foo foo.1 bar bar.1 spam spam.1
bar 0.83 0.86 0.00 0.25 0.88 0.87
bar 0.84 0.85 0.25 0.00 0.82 0.86
spam 0.90 0.92 0.88 0.82 0.00 0.50
Upvotes: 2