Reputation: 4557
Using Python 3.6, I have the results of a text reader that gives me repeating lines like this:
df
Col 1
0 Text A1
1 Text B1
2 Text C1
3 Text D1
4 Text E1
5 Text A2
6 Text B2
7 Text C2
8 Text D2
9 Text E2
10 Text A3
11 Text B3
12 Text C3
13 Text D3
14 Text E3
- * Added edit: Some of the above texts are blanks. There are no commas that I can do a str.split() and I'm not sure reshaping is the right way to go. The information repeats every 5 entries and I'm trying to separate them into columns so that it looks like:
Col1 Col2 Col3 Col4 Col5
0 Text A1 Text B1 Text C1 Text D1 Text E1
1 Text A2 Text B2 Text C2 Text D2 Text E2
2 Text A3 Text B3 Text C3 Text D3 Text E3
What is the pythonic way to reshape or split into 5 columns not relying on punctuation from the text?
Upvotes: 2
Views: 117
Reputation: 79388
What if some of the repeat was not actually 5? ie some repeated 4 while others 5 etc?? What if you do not know the number of repeats?: You can just fix the data and then reshape into wide
a = df.assign(ID = df['Col 1'].replace('\\D','',regex=True),
col = df['Col 1'].replace('.*(.)\\d','col \\1',regex=True))
pd.crosstab(a.ID,a.col,a['Col 1'],aggfunc=lambda x:x) #Using an identity function
col col A col B col C col D col E
ID
1 Text A1 Text B1 Text C1 Text D1 Text E1
2 Text A2 Text B2 Text C2 Text D2 Text E2
3 Text A3 Text B3 Text C3 Text D3 Text E3
Upvotes: 1
Reputation: 403278
Group by index mod 5 and concat
horizontally. The last step is just fixing the column names with rename
.
df = pd.concat([g.reset_index(drop=True)
for _, g in df.groupby(df.index % 5)], axis=1)
df.columns = [f'Col {i}' for i in range(1, 6)]
Col 1 Col 2 Col 3 Col 4 Col 5
0 Text A1 Text B1 Text C1 Text D1 Text E1
1 Text A2 Text B2 Text C2 Text D2 Text E2
2 Text A3 Text B3 Text C3 Text D3 Text E3
Upvotes: 4
Reputation: 59304
IIUC, can groupby
the arange
and just use pd.DataFrame
default constructor
n = df.shape[0] // 5
pd.DataFrame(df.groupby(np.repeat(np.arange(n), 5)).col.apply(lambda s: s.tolist()).tolist())
0 1 2 3 4
0 Text A1 Text B1 Text C1 Text D1 Text E1
1 Text A2 Text B2 Text C2 Text D2 Text E2
2 Text A3 Text B3 Text C3 Text D3 Text E3
Upvotes: 4
Reputation: 353604
TBH, if you know they repeat every 5, I would reshape:
In [36]: pd.DataFrame(df.values.reshape(-1, 5), columns=[f"Col {i}" for i in range(1,6)])
Out[36]:
Col 1 Col 2 Col 3 Col 4 Col 5
0 Text A1 Text B1 Text C1 Text D1 Text E1
1 Text A2 Text B2 Text C2 Text D2 Text E2
2 Text A3 Text B3 Text C3 Text D3 Text E3
Personally though I'm wary of missing values, and so I'd probably groupby on some function of the strings, e.g.
pd.concat([v.reset_index(drop=True)
for _, v in df.groupby(df["Col 1"].str.rstrip(string.digits))], axis=1)
or something.
Upvotes: 5
Reputation: 323396
By using pd.crosstab
pd.crosstab(index=np.arange(len(df))//5,columns = np.arange(len(df))%5,values=df['Col 1'],aggfunc='sum')
Out[382]:
col_0 0 1 2 3 4
row_0
0 Text A1 Text B1 Text C1 Text D1 Text E1
1 Text A2 Text B2 Text C2 Text D2 Text E2
2 Text A3 Text B3 Text C3 Text D3 Text E3
Upvotes: 4