Reputation: 55
I have a CSV file containing a time serie of daily precipitation. The problem arises of how the data is organized. Here a small sample to ilustrate:
date p01 p02 p03 p04 p05 p06
01-01-1941 33.6 7.1 22.3 0 0 0
01-02-1941 0 0 1.1 11.3 0 0
So, there is a column to each day of the month (p01 is the precipitation of the day 1, p02 corresponds to the day 2, and so on ). I'd like to have this structure: one column to date and another to precipitation values.
date p
01-01-1941 33.6
02-01-1941 7.1
03-01-1941 22.3
04-01-1941 0
05-01-1941 0
06-01-1941 0
01-02-1941 0
02-02-1941 0
03-02-1941 1.1
04-02-1941 11.3
05-02-1941 0
06-02-1941 0
I have found some code examples, but unsuccessfully for this specific problem. In general they suggest to try using pandas, numpy. Does anyone have a recommendation to solve this issue or a good advice to guide my studies? Thanks. (I'm sorry for my terrible English)
Upvotes: 3
Views: 156
Reputation: 862511
I think you can first use read_csv
, then to_datetime
with stack
for reshaping DataFrame
, then convert column days
to_timedelta
and add it to column date
:
import pandas as pd
import io
temp=u"""date;p01;p02;p03;p04;p05;p06
01-01-1941;33.6;7.1;22.3;0;0;0
01-02-1941;0;0;1.1;11.3;0;0"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print df
date p01 p02 p03 p04 p05 p06
0 01-01-1941 33.6 7.1 22.3 0.0 0 0
1 01-02-1941 0.0 0.0 1.1 11.3 0 0
#convert column date to datetime
df.date = pd.to_datetime(df.date, dayfirst=True)
print df
date p01 p02 p03 p04 p05 p06
0 1941-01-01 33.6 7.1 22.3 0.0 0 0
1 1941-02-01 0.0 0.0 1.1 11.3 0 0
#stack, rename columns
df1 = df.set_index('date').stack().reset_index(name='p').rename(columns={'level_1':'days'})
print df1
date days p
0 1941-01-01 p01 33.6
1 1941-01-01 p02 7.1
2 1941-01-01 p03 22.3
3 1941-01-01 p04 0.0
4 1941-01-01 p05 0.0
5 1941-01-01 p06 0.0
6 1941-02-01 p01 0.0
7 1941-02-01 p02 0.0
8 1941-02-01 p03 1.1
9 1941-02-01 p04 11.3
10 1941-02-01 p05 0.0
11 1941-02-01 p06 0.0
#convert column to timedelta in days
df1.days = pd.to_timedelta(df1.days.str[1:].astype(int) - 1, unit='D')
print df1.days
0 0 days
1 1 days
2 2 days
3 3 days
4 4 days
5 5 days
6 0 days
7 1 days
8 2 days
9 3 days
10 4 days
11 5 days
Name: days, dtype: timedelta64[ns]
#add timedelta
df1['date'] = df1['date'] + df1['days']
#remove unnecessary column
df1 = df1.drop('days', axis=1)
print df1
date p
0 1941-01-01 33.6
1 1941-01-02 7.1
2 1941-01-03 22.3
3 1941-01-04 0.0
4 1941-01-05 0.0
5 1941-01-06 0.0
6 1941-02-01 0.0
7 1941-02-02 0.0
8 1941-02-03 1.1
9 1941-02-04 11.3
10 1941-02-05 0.0
11 1941-02-06 0.0
Upvotes: 2
Reputation: 1896
Well I got the answer but it did not with just one command or any magic function. So here is how I got the answer. You can optimise this code further. Hope this helps!
import pandas as pd
from datetime import timedelta
df = pd.read_csv('myfile.csv')
df[u'date'] = pd.to_datetime(df[u'date'])
p1 = df[[u'date', u'p01']].copy()
p2 = df[[u'date', u'p02']].copy()
p3 = df[[u'date', u'p03']].copy()
p4 = df[[u'date', u'p04']].copy()
p5 = df[[u'date', u'p05']].copy()
# renaming cols -p1,p2,p3,p4
p1.columns = ['date','val']
p2.columns = ['date','val']
p3.columns = ['date','val']
p4.columns = ['date','val']
p5.columns = ['date','val']
p1['col'] = 'p01'
p2['col'] = 'p02'
p3['col'] = 'p03'
p4['col'] = 'p04'
p5['col'] = 'p05'
main = pd.concat([p1,p2,p3,p4,p5])
main['days2add'] = main['col'].apply(lambda x: int(x.strip('p')) -1 )
ff = lambda row : row[u'date'] + timedelta(row[u'days2add'])
main['new_date'] = main.apply(ff, axis=1)
In [209]: main[['new_date', u'val']]
Out[209]:
new_date val
0 1941-01-01 33.6
0 1941-01-02 7.1
0 1941-01-03 22.3
0 1941-01-04 0.0
0 1941-01-05 0.0
my csv file content:
In [210]: df
Out[210]:
date p01 p02 p03 p04 p05 p06
0 1941-01-01 33.6 7.1 22.3 0 0 0
my output content:
In [209]: main[['new_date', u'val']]
Out[209]:
new_date val
0 1941-01-01 33.6
0 1941-01-02 7.1
0 1941-01-03 22.3
0 1941-01-04 0.0
0 1941-01-05 0.0
Upvotes: 0
Reputation: 13465
EDIT: Sorry the name of the question was a bit misleading. For the example output you gave (collapsing all p
into a single column) you can do this:
# Opening the example file you gave
fid = open('csv.txt','r')
lines = fid.readlines()
fid.close()
fid = open('output2.txt','w')
fid.write('%15s %15s\n'%(lines[0].split()[0],'p'))
for i in range(1,len(lines)):
iline = lines[i].split()
for j in range(1,len(iline)):
fid.write('%15s %15s\n'%(iline[0],iline[j]))
fid.close()
, which results in this:
date p
01-01-1941 33.6
01-01-1941 7.1
01-01-1941 22.3
01-01-1941 0
01-01-1941 0
01-01-1941 0
01-02-1941 0
01-02-1941 0
01-02-1941 1.1
01-02-1941 11.3
01-02-1941 0
01-02-1941 0
ORIGINAL POST: Might be relevant to someone.
There are indeed many ways to do this. But considering you have no special preference (and if the file is not enormous) you may just want to use native Python.
def rows2columns(lines):
ilines = []
for i in lines:
ilines.append(i.split())
new = []
for j in range(len(ilines[0])):
local = []
for i in range(len(ilines)):
local.append(ilines[i][j])
new.append(local)
return new
def writefile(new,path='output.txt'):
fid = open(path,'w')
for i in range(len(new)):
for j in range(len(new[0])):
fid.write('%15s'%new[i][j])
fid.write('\n')
fid.close()
# Opening the example file you gave
fid = open('csv.txt','r')
lines = fid.readlines()
fid.close()
# Putting the list of lines to be reversed
new = rows2columns(lines)
# Writing the result to a file
writefile(new,path='output.txt')
, the output file is this:
date 01-01-1941 01-02-1941
p01 33.6 0
p02 7.1 0
p03 22.3 1.1
p04 0 11.3
p05 0 0
p06 0 0
This is probably the most simple (or close) native python recipe you may have. Other functionalities from csv module, or numpy, or pandas will have other features you might want to take advantage. This one in particular does not need imports.
Upvotes: 0