João dos Reis
João dos Reis

Reputation: 55

Row into column in Python

I have a CSV file containing a time serie of daily precipitation. The problem arises of how the data is organized. Here a small sample to ilustrate:

date        p01 p02 p03 p04 p05 p06
01-01-1941  33.6 7.1 22.3 0 0 0
01-02-1941  0 0 1.1 11.3 0 0

So, there is a column to each day of the month (p01 is the precipitation of the day 1, p02 corresponds to the day 2, and so on ). I'd like to have this structure: one column to date and another to precipitation values.

date        p
01-01-1941  33.6
02-01-1941  7.1
03-01-1941  22.3
04-01-1941  0
05-01-1941  0
06-01-1941  0
01-02-1941  0
02-02-1941  0
03-02-1941  1.1
04-02-1941  11.3
05-02-1941  0
06-02-1941  0

I have found some code examples, but unsuccessfully for this specific problem. In general they suggest to try using pandas, numpy. Does anyone have a recommendation to solve this issue or a good advice to guide my studies? Thanks. (I'm sorry for my terrible English)

Upvotes: 3

Views: 156

Answers (3)

jezrael
jezrael

Reputation: 862511

I think you can first use read_csv, then to_datetime with stack for reshaping DataFrame, then convert column days to_timedelta and add it to column date:

import pandas as pd
import io

temp=u"""date;p01;p02;p03;p04;p05;p06
01-01-1941;33.6;7.1;22.3;0;0;0
01-02-1941;0;0;1.1;11.3;0;0"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print df
         date   p01  p02   p03   p04  p05  p06
0  01-01-1941  33.6  7.1  22.3   0.0    0    0
1  01-02-1941   0.0  0.0   1.1  11.3    0    0
#convert column date to datetime
df.date = pd.to_datetime(df.date, dayfirst=True)
print df
        date   p01  p02   p03   p04  p05  p06
0 1941-01-01  33.6  7.1  22.3   0.0    0    0
1 1941-02-01   0.0  0.0   1.1  11.3    0    0

#stack, rename columns
df1 = df.set_index('date').stack().reset_index(name='p').rename(columns={'level_1':'days'})
print df1
         date days     p
0  1941-01-01  p01  33.6
1  1941-01-01  p02   7.1
2  1941-01-01  p03  22.3
3  1941-01-01  p04   0.0
4  1941-01-01  p05   0.0
5  1941-01-01  p06   0.0
6  1941-02-01  p01   0.0
7  1941-02-01  p02   0.0
8  1941-02-01  p03   1.1
9  1941-02-01  p04  11.3
10 1941-02-01  p05   0.0
11 1941-02-01  p06   0.0
#convert column to timedelta in days
df1.days = pd.to_timedelta(df1.days.str[1:].astype(int) - 1, unit='D')
print df1.days
0    0 days
1    1 days
2    2 days
3    3 days
4    4 days
5    5 days
6    0 days
7    1 days
8    2 days
9    3 days
10   4 days
11   5 days
Name: days, dtype: timedelta64[ns]

#add timedelta
df1['date'] = df1['date'] + df1['days']
#remove unnecessary column
df1 = df1.drop('days', axis=1)
print df1
         date     p
0  1941-01-01  33.6
1  1941-01-02   7.1
2  1941-01-03  22.3
3  1941-01-04   0.0
4  1941-01-05   0.0
5  1941-01-06   0.0
6  1941-02-01   0.0
7  1941-02-02   0.0
8  1941-02-03   1.1
9  1941-02-04  11.3
10 1941-02-05   0.0
11 1941-02-06   0.0

Upvotes: 2

sam
sam

Reputation: 1896

Well I got the answer but it did not with just one command or any magic function. So here is how I got the answer. You can optimise this code further. Hope this helps!

import pandas as pd

from datetime import timedelta


df = pd.read_csv('myfile.csv')

df[u'date'] = pd.to_datetime(df[u'date'])



p1 = df[[u'date', u'p01']].copy()
p2 = df[[u'date', u'p02']].copy()
p3 = df[[u'date', u'p03']].copy()
p4 = df[[u'date', u'p04']].copy()
p5 = df[[u'date', u'p05']].copy()

# renaming cols -p1,p2,p3,p4
p1.columns = ['date','val']
p2.columns = ['date','val']
p3.columns = ['date','val']
p4.columns = ['date','val']
p5.columns = ['date','val']

p1['col'] = 'p01'
p2['col'] = 'p02'
p3['col'] = 'p03'
p4['col'] = 'p04'
p5['col'] = 'p05'


main = pd.concat([p1,p2,p3,p4,p5])


main['days2add'] = main['col'].apply(lambda x: int(x.strip('p')) -1 )

ff = lambda row : row[u'date'] + timedelta(row[u'days2add'])

main['new_date'] = main.apply(ff, axis=1)

In [209]: main[['new_date', u'val']]
Out[209]:
    new_date   val
0 1941-01-01  33.6
0 1941-01-02   7.1
0 1941-01-03  22.3
0 1941-01-04   0.0
0 1941-01-05   0.0

my csv file content:

In [210]: df
Out[210]:
        date   p01  p02   p03  p04  p05  p06
0 1941-01-01  33.6  7.1  22.3    0    0    0

my output content:

In [209]: main[['new_date', u'val']]
Out[209]:
    new_date   val
0 1941-01-01  33.6
0 1941-01-02   7.1
0 1941-01-03  22.3
0 1941-01-04   0.0
0 1941-01-05   0.0

Upvotes: 0

armatita
armatita

Reputation: 13465

EDIT: Sorry the name of the question was a bit misleading. For the example output you gave (collapsing all p into a single column) you can do this:

# Opening the example file you gave
fid = open('csv.txt','r')
lines = fid.readlines()
fid.close()

fid = open('output2.txt','w')
fid.write('%15s %15s\n'%(lines[0].split()[0],'p'))
for i in range(1,len(lines)):
    iline = lines[i].split()
    for j in range(1,len(iline)):
        fid.write('%15s %15s\n'%(iline[0],iline[j]))
fid.close()

, which results in this:

       date               p
 01-01-1941            33.6
 01-01-1941             7.1
 01-01-1941            22.3
 01-01-1941               0
 01-01-1941               0
 01-01-1941               0
 01-02-1941               0
 01-02-1941               0
 01-02-1941             1.1
 01-02-1941            11.3
 01-02-1941               0
 01-02-1941               0

ORIGINAL POST: Might be relevant to someone.

There are indeed many ways to do this. But considering you have no special preference (and if the file is not enormous) you may just want to use native Python.

def rows2columns(lines):
    ilines = []
    for i in lines:
        ilines.append(i.split())
    new = []
    for j in range(len(ilines[0])):
        local = []
        for i in range(len(ilines)):
            local.append(ilines[i][j])
        new.append(local)
    return new

def writefile(new,path='output.txt'):
    fid = open(path,'w')
    for i in range(len(new)):
        for j in range(len(new[0])):
            fid.write('%15s'%new[i][j])
        fid.write('\n')
    fid.close()

# Opening the example file you gave
fid = open('csv.txt','r')
lines = fid.readlines()
fid.close()

# Putting the list of lines to be reversed
new = rows2columns(lines)
# Writing the result to a file
writefile(new,path='output.txt')

, the output file is this:

       date     01-01-1941     01-02-1941
        p01           33.6              0
        p02            7.1              0
        p03           22.3            1.1
        p04              0           11.3
        p05              0              0
        p06              0              0

This is probably the most simple (or close) native python recipe you may have. Other functionalities from csv module, or numpy, or pandas will have other features you might want to take advantage. This one in particular does not need imports.

Upvotes: 0

Related Questions