Reputation: 73
I have hundreds of text files like these, with each column separated by three spaces. The data is for a year: 12 months and 31 days for each month.
Below, I'm only showing below what's relevant to question:
001 DIST - ADILABAD ANDHRA MEAN TEMP
DATE JAN FEB MAR . . . . NOV DEC
01 21.5 24.3 27.1 25.8 22.4
02 21.4 24.2 27.1 25.8 22.4
. . . . . .
. . . . . .
. . . . . .
27 23.6 26.8 30.3 23.1 21.3
28 23.8 27.0 30.6 22.9 21.3
29 23.4 31.0 22.9 21.2
30 23.5 31.1 22.6 21.4
31 23.8 31.2 . . . . 21.6
I want to read each column into an array and then average it.
For this I'm using the genfromtext()
function like this:
import numpy as np
JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC = np.genfromtxt("tempmean_andhra_adilabad.txt", skiprows=3,
unpack=True, invalid_raise=False,
usecols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
autostrip=True)
As you can see I've skipped the first three rows and the first column and unpacked each column in an array. Without invalid_raise=False
, I was getting the following error:
Traceback (most recent call last):
File "pyshell#32", line 1, in 'module'
JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC = np.genfromtxt("temp mean_andhra_adilabad.txt",skiprows=3,unpack=True,usecols=(1,2,3,4,5,6,7,8,9,10,11,12),autostrip=True)
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 1667, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #32 (got 12 columns instead of 12)
Line #33 (got 12 columns instead of 12)
Line #34 (got 8 columns instead of 12)
I think this problem is because columns have different length? Or some other reason?
I wanted to see the output so I used invalid_raise=False
. Now my problem is that when I'm printing any of the array, like JAN
I'm only getting 28 elements. i.e. Every array has only 28 elements. It seems that only 28 rows are read for each column as FEB
column ends with 28 days. But I need the data for each month i.e. 31 elements for JAN
30 for JUNE
etc.
How do I get all elements for each month?
I think it's a very basic question but I'm very new to Python and NumPy
and began learning just two weeks back. I've searched a lot of questions on StackOverflow and Google and learned about how to skip rows, columns etc. But I could not find any answer relating to this particular question.
Please suggest some module, function, code etc.
Thanks in advance.
Upvotes: 7
Views: 5806
Reputation: 394041
UPDATE
Thanks to Warren Weckesser for pointing out that you can pass the width values which will handle this file properly
OK, pandas reads fixed width files fine:
In [192]:
df = pd.read_fwf(r'c:\data\temp mean_andhra_adilabad.txt',skiprows=2, widths=(5,)+(7,)*12, skip_footer=1)
df
Out[192]:
DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV \
0 1 21.5 24.3 27.1 31.3 34.1 34.5 29.0 27.5 27.1 28.0 25.8
1 2 21.4 24.2 27.1 31.4 33.8 34.1 28.8 27.5 27.1 28.0 25.8
2 3 21.2 24.3 27.1 31.5 34.4 34.1 28.6 27.5 27.0 28.0 25.6
3 4 21.2 24.4 27.1 31.7 34.4 33.8 28.5 27.1 27.0 27.9 25.5
4 5 21.4 24.6 27.6 31.7 34.4 33.5 28.2 27.0 27.1 27.8 25.4
5 6 21.7 24.4 28.0 31.6 34.5 33.3 28.2 27.1 27.0 28.0 25.1
6 7 21.8 24.1 28.1 31.5 34.5 32.9 28.2 27.1 27.0 27.8 25.3
7 8 22.0 24.4 28.3 31.8 34.6 33.3 27.9 26.7 27.1 27.9 25.1
8 9 22.0 24.5 28.3 32.2 34.6 33.1 27.8 26.6 27.2 28.1 24.8
9 10 22.3 24.6 28.4 32.1 34.5 32.5 28.0 26.7 27.2 27.9 25.0
10 11 22.3 24.9 28.6 32.3 34.4 32.2 27.8 26.9 27.2 28.0 25.2
11 12 22.3 25.0 28.3 32.6 34.4 32.0 27.6 27.1 27.3 27.9 24.9
12 13 22.5 25.1 28.6 32.7 34.5 31.4 27.8 27.1 27.5 27.8 24.8
13 14 22.5 25.6 28.7 33.1 34.7 31.2 27.7 26.8 27.6 27.7 24.6
14 15 22.5 25.7 29.1 33.2 34.6 31.0 27.8 27.0 27.9 27.6 24.6
15 16 22.5 25.7 29.4 33.1 34.4 30.6 27.7 26.9 28.0 27.6 24.5
16 17 22.5 25.8 29.5 32.8 34.6 30.1 27.8 26.8 28.1 27.2 24.3
17 18 22.6 26.0 29.9 33.0 34.8 30.1 27.6 27.0 28.2 27.3 24.0
18 19 22.8 25.9 30.2 33.3 34.7 30.0 27.9 27.0 28.1 27.2 24.0
19 20 23.1 25.9 30.2 33.3 35.1 30.2 27.9 27.0 27.9 27.2 24.0
20 21 23.1 25.8 30.2 33.5 34.9 30.1 27.8 26.9 28.0 26.9 23.8
21 22 22.8 25.8 30.6 33.4 35.1 29.8 27.8 26.8 28.2 26.7 23.5
22 23 22.9 25.8 30.6 33.4 35.1 29.6 27.8 26.8 28.2 26.7 23.5
23 24 23.1 26.2 30.4 33.5 35.1 29.3 27.8 27.0 28.1 26.5 23.5
24 25 23.4 26.5 30.2 33.5 35.1 29.2 27.6 27.3 28.1 26.5 23.3
25 26 23.5 26.7 30.3 33.6 35.0 29.1 27.6 27.4 28.2 26.4 23.0
26 27 23.6 26.8 30.3 33.8 35.1 28.8 27.6 27.1 28.2 26.2 23.1
27 28 23.8 27.0 30.6 34.1 34.9 28.5 27.6 26.8 28.2 26.0 22.9
28 29 23.4 NaN 31.0 34.3 34.8 28.5 27.4 27.0 28.1 25.8 22.9
29 30 23.5 NaN 31.1 34.5 34.6 29.1 27.4 27.0 28.1 25.7 22.6
30 31 23.8 NaN 31.2 NaN 34.7 NaN 27.4 27.0 NaN 25.7 NaN
DEC
0 22.4
1 22.4
2 22.5
3 22.5
4 22.6
5 22.3
6 22.0
7 22.0
8 21.8
9 21.7
10 21.9
11 21.9
12 21.8
13 21.5
14 21.5
15 21.5
16 21.7
17 21.6
18 21.7
19 21.7
20 21.8
21 21.7
22 21.8
23 21.8
24 21.7
25 21.6
26 21.3
27 21.3
28 21.2
29 21.4
30 21.6
In [193]:
df.mean(axis=0)
Out[193]:
DATE 16.000000
JAN 22.548387
FEB 25.357143
MAR 29.229032
APR 32.793333
MAY 34.658065
JUN 31.196667
JUL 27.890323
AUG 27.016129
SEP 27.666667
OCT 27.225806
NOV 24.346667
DEC 21.812903
dtype: float64
Upvotes: 4
Reputation: 114811
You data is not "delimited" by text. Instead it has fixed-width columns. As @EdChum shows in his answer, pandas has a function for reading data with fixed-width columns. You can also use genfromtxt
by giving the column widths in the delimiter
argument. It looks like the field widths are (4, 7, 7, 7, ...). In the code below, I'll write this as (4,) + (7,)*12
:
In [27]: (4,) + (7,)*12
Out[27]: (4, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7)
The default data type used by genfromtxt
is np.float64
. If a field can't be converted to a float, it will be replaced with nan
. So the data at the end of the months with fewer than 31 days will be nan
.
In the following, I renamed your file to "temp_mean.txt". Note that your file has an extra blank line at the end, so the argument skip_footer=1
is also used. If you don't use this argument, you'll get an extra row of nan
values in data
.
In [16]: data = genfromtxt("temp_mean.txt", skiprows=3, delimiter=(4,)+(7,)*12, usecols=range(1,13), skip_footer=1)
In [17]: data.shape
Out[17]: (31, 12)
In [18]: data[:,0] # JAN
Out[18]:
array([ 21.5, 21.4, 21.2, 21.2, 21.4, 21.7, 21.8, 22. , 22. ,
22.3, 22.3, 22.3, 22.5, 22.5, 22.5, 22.5, 22.5, 22.6,
22.8, 23.1, 23.1, 22.8, 22.9, 23.1, 23.4, 23.5, 23.6,
23.8, 23.4, 23.5, 23.8])
In [19]: data[:,1] # FEB
Out[19]:
array([ 24.3, 24.2, 24.3, 24.4, 24.6, 24.4, 24.1, 24.4, 24.5,
24.6, 24.9, 25. , 25.1, 25.6, 25.7, 25.7, 25.8, 26. ,
25.9, 25.9, 25.8, 25.8, 25.8, 26.2, 26.5, 26.7, 26.8,
27. , nan, nan, nan])
In [20]: data[-1,:] # Last row.
Out[20]:
array([ 23.8, nan, 31.2, nan, 34.7, nan, 27.4, 27. , nan,
25.7, nan, 21.6])
To get the monthly means, you can use np.nanmean
:
In [21]: np.nanmean(data, axis=0)
Out[21]:
array([ 22.5483871 , 25.35714286, 29.22903226, 32.79333333,
34.65806452, 31.19666667, 27.89032258, 27.01612903,
27.66666667, 27.22580645, 24.34666667, 21.81290323])
Upvotes: 5