Reputation: 1073
I have a file that is formatted like this:
(Year - Month - Day - Data)
1980 - 1 - 1 - 1.2
1980 - 1 - 2 - 1.3
1980 - 1 - 3 - 1.4
1980 - 1 - 4 - 1.5
1980 - 1 - 5 - 1.6
1980 - 1 - 6 - 1.7
1980 - 1 - 7 - 1.8
It is in a numpy array. It is data over the course of about 24 years, so what I want to be able to do is take the average per day and put it into a seperate 1D-array that would just be 366 (for leap year) averages, which I could then plot using matplotlib and be able to see the trend over the course of the years. If there anyway to use subsetting in a loop so I could accomplish this?
Upvotes: 2
Views: 5614
Reputation: 21643
For anyone coming to this question hoping to find an alternative way of processing unusual input here is some code.
In its essentials, the code reads the input file a line at a time, picks out the elements of dates and values, reassembles these into lines that pandas can readily parse and puts them into a StringIO object.
Pandas reads them from there, as if from a csv file. I have cribbed the grouping code from PiRSquared.
import pandas as pd
import re
from io import StringIO
file_name = 'temp.txt'
for_pd = StringIO()
with open(file_name) as f:
for line in f:
pieces = re.search(r'([0-9]{4}) - ([0-9]{,2}) - ([0-9]{,2}) - ([0-9.]+)', line).groups()
pieces = [int(_) for _ in pieces[:3]] + [pieces[3]]
print ('%.4i-%.2i-%.2i,%s' % tuple(pieces), file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd, header=None, names=['datetimes', 'values'], parse_dates=['datetimes'])
print (df.set_index('datetimes').groupby(pd.TimeGrouper('D')).mean().dropna())
print (df.set_index('datetimes').groupby(pd.TimeGrouper('W')).mean().dropna())
This is the output.
values
datetimes
1980-01-01 1.2
1980-01-02 1.3
1980-01-03 1.4
1980-01-04 1.5
1980-01-05 1.6
1980-01-06 1.7
1980-01-07 1.8
values
datetimes
1980-01-06 1.45
1980-01-13 1.80
Upvotes: 1
Reputation: 1200
Using pandas is definitely the way to go. There are at least two ways to group by 'day of the year', you could do either the numeric day of the year as a string or the string monthday
combination like so:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range('2000-01-01', '2010-12-31'))
df['vals'] = np.random.randint(1, 6, df.shape[0])
print(df.groupby(df.index.strftime("%j")).mean())
print(df.groupby(df.index.strftime("%m%d")).mean())
Upvotes: 4