Reputation: 1584
Still I am pretty new to python, would need help in this :
The data I have is in csv format like this :
Month YEAR AZ-Phoenix CA-Los Angeles CA-San Diego CA-San Francisco CO-Denver DC-Washington January 1987 59.33 54.67 46.61 50.20 February 1987 59.65 54.89 46.87 49.96 64.77
And this needs to be merged and displayed in column 2 and 3 by incrementing column 1 n.. times.
Output should be :
Month YEAR January 1987 AZ-Phoenix January 1987 CA-Los Angeles 59.33 January 1987 CA-San Diego 54.67 January 1987 CA-San Francisco 46.61 January 1987 CO-Denver 50.20
How can this be achieved in csv reader?
Upvotes: 4
Views: 872
Reputation: 294218
option 1
use pd.melt
pd.melt(df, 'YEAR')
YEAR variable value
0 January 1987 AZ-Phoenix 59.33
1 February 1987 AZ-Phoenix 59.65
2 January 1987 CA-Los Angeles 54.67
3 February 1987 CA-Los Angeles 54.89
4 January 1987 CA-San Diego 46.61
5 February 1987 CA-San Diego 46.87
6 January 1987 CA-San Francisco 50.20
7 February 1987 CA-San Francisco 49.96
8 January 1987 CO-Denver NaN
9 February 1987 CO-Denver 64.77
10 January 1987 DC-Washington NaN
11 February 1987 DC-Washington NaN
option 2
reconstruct with numpy
tools
pd.DataFrame(dict(
YEAR=df.YEAR.values.repeat(len(df.columns) - 1),
B=df.drop('YEAR', 1).values.ravel(),
A=np.tile(df.columns.difference(['YEAR']).values, len(df)),
))[['YEAR', 'A', 'B']]
YEAR variable value
0 January 1987 AZ-Phoenix 59.33
1 February 1987 AZ-Phoenix 59.65
2 January 1987 CA-Los Angeles 54.67
3 February 1987 CA-Los Angeles 54.89
4 January 1987 CA-San Diego 46.61
5 February 1987 CA-San Diego 46.87
6 January 1987 CA-San Francisco 50.20
7 February 1987 CA-San Francisco 49.96
8 January 1987 CO-Denver NaN
9 February 1987 CO-Denver 64.77
10 January 1987 DC-Washington NaN
11 February 1987 DC-Washington NaN
setup
df = pd.read_csv(sep='\s{2,}', engine='python')
Upvotes: 2
Reputation: 862511
Use read_csv
with separator tab
- \t
or if separator is 2 and more whitespaces
use piRSquared's
solution:
import pandas as pd
df = pd.read_csv(sep='\t')
I think you need:
df = df.set_index('YEAR').stack(dropna=False).reset_index()
df.columns = ['YEAR','A','B']
print (df)
YEAR A B
0 January 1987 AZ-Phoenix 59.33
1 January 1987 CA-Los Angeles 54.67
2 January 1987 CA-San 46.61
3 January 1987 Diego 50.20
4 January 1987 CA-San Francisco NaN
5 January 1987 CO-Denver NaN
6 January 1987 DC-Washington NaN
7 February 1987 AZ-Phoenix 59.65
8 February 1987 CA-Los Angeles 54.89
9 February 1987 CA-San 46.87
10 February 1987 Diego 49.96
11 February 1987 CA-San Francisco 64.77
12 February 1987 CO-Denver NaN
13 February 1987 DC-Washington NaN
#if need remove rows with NaN
df = df.set_index('YEAR').stack().reset_index()
df.columns = ['YEAR','A','B']
print (df)
YEAR A B
0 January 1987 AZ-Phoenix 59.33
1 January 1987 CA-Los Angeles 54.67
2 January 1987 CA-San 46.61
3 January 1987 Diego 50.20
4 February 1987 AZ-Phoenix 59.65
5 February 1987 CA-Los Angeles 54.89
6 February 1987 CA-San 46.87
7 February 1987 Diego 49.96
8 February 1987 CA-San Francisco 64.77
Another solution with melt
:
df = pd.melt(df, id_vars='YEAR', value_name='B', var_name='A')
print (df)
YEAR A B
0 January 1987 AZ-Phoenix 59.33
1 February 1987 AZ-Phoenix 59.65
2 January 1987 CA-Los Angeles 54.67
3 February 1987 CA-Los Angeles 54.89
4 January 1987 CA-San 46.61
5 February 1987 CA-San 46.87
6 January 1987 Diego 50.20
7 February 1987 Diego 49.96
8 January 1987 CA-San Francisco NaN
9 February 1987 CA-San Francisco 64.77
10 January 1987 CO-Denver NaN
11 February 1987 CO-Denver NaN
12 January 1987 DC-Washington NaN
13 February 1987 DC-Washington NaN
#if need remove rows with NaN
df = pd.melt(df, id_vars='YEAR', value_name='B', var_name='A').dropna(subset=['B'])
print (df)
YEAR A B
0 January 1987 AZ-Phoenix 59.33
1 February 1987 AZ-Phoenix 59.65
2 January 1987 CA-Los Angeles 54.67
3 February 1987 CA-Los Angeles 54.89
4 January 1987 CA-San 46.61
5 February 1987 CA-San 46.87
6 January 1987 Diego 50.20
7 February 1987 Diego 49.96
9 February 1987 CA-San Francisco 64.77
Upvotes: 2