Reputation: 13
I need to splits items in a row to columns on the occurence of '^' character. This needs to done through pandas without a loop preferably
I have
0 A^
3 206-1C
4 502-2B
5 506-0.5C
6 604-1B
7 907-2B
8 G.ELITE^
9 A201-1C
10 A202-1B
I want
A^ | G.ELITE^ |
---|---|
206-1C | A201-1C. |
502-2B. | A202-1B |
506-0.5C | |
604-1B | |
907-2B. |
test=rd.query('texts.str.contains("\^")', engine='python')
Upvotes: 0
Views: 80
Reputation: 2152
import pandas as pd
import pandas as pd
import janitor
import numpy as np
# Sample DataFrame
df = pd.DataFrame({'col': ['A^', '3', '206-1C', '502-2B', '506-0.5C', '604-1B', '907-2B', 'G.ELITE^', 'A201-1C', 'A202-1B']})
"""print(df)
col
0 A^
1 3
2 206-1C
3 502-2B
4 506-0.5C
5 604-1B
6 907-2B
7 G.ELITE^
8 A201-1C
9 A202-1B"""
sep='^'
fill_value = np.nan
end_with_sep = df.iloc[:,0].str.endswith(sep)
"""print(end_with_sep)
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False"""
values_ending_with_caret = df[end_with_sep].iloc[:, 0].tolist()
"""print(values_ending_with_caret )
['A^', 'G.ELITE^']"""
#split_points = end_with_sep.cumsum()[end_with_sep].index.tolist() + [len(df)]
split_points = list(end_with_sep.index[np.where(end_with_sep)[0]]) + [len(df)]
print(split_points)#[0, 7, 10]
split_points1 = split_points[:-1]
print(split_points1)#[0, 7]
split_points2 = split_points[1:]
print(split_points2)#[7, 10]
dfs = []
for i,j in zip(split_points1,split_points2):
# start from i+1 to exclude the header row
aa = df.iloc[i +1:j].reset_index(drop=True)
"""print(aa)
col
0 3
1 206-1C
2 502-2B
3 506-0.5C
4 604-1B
5 907-2B"""
dfs.append(aa)
#print(dfs)
res = pd.concat(dfs,axis=1)
"""print(res)
col col
0 3 A201-1C
1 206-1C A202-1B
2 502-2B NaN
3 506-0.5C NaN
4 604-1B NaN
5 907-2B NaN"""
res.columns = values_ending_with_caret
"""print(res)
A^ G.ELITE^
0 3 A201-1C
1 206-1C A202-1B
2 502-2B NaN
3 506-0.5C NaN
4 604-1B NaN
5 907-2B NaN
"""
Upvotes: 0
Reputation: 147216
One way to do this is to split the dataframe where the value ends in ^
and then join those split dataframes horizontally, setting the column names from the splitting values:
# find values ending with ^
m = df.iloc[:, 0].str.endswith('^')
# split the dataframe at those points
s = list(np.nonzero(m)[0]) + [len(df)]
dfs = [df.iloc[s[i]+1:s[i+1]].reset_index(drop=True) for i in range(len(s)-1)]
# append the split frame as columns
out = pd.concat(dfs, axis=1).fillna('')
# and set the column names
out.columns = df[m].iloc[:, 0].to_list()
Assumed sample data:
df = pd.DataFrame({'value': ['A^', '206-1C', '502-2B', '506-0.5C', '604-1B', '907-2B', 'G.ELITE^', 'A201-1C', 'A202-1B']})
Output:
A^ G.ELITE^
0 206-1C A201-1C
1 502-2B A202-1B
2 506-0.5C
3 604-1B
4 907-2B
Upvotes: 0
Reputation: 262284
You can try a pivot
based approach:
# get column of interest
s = df.iloc[:, 0]
# identify strings ending in "^"
m = s.str.endswith('^')
# reshape
out = (s
.to_frame(name='value')
.assign(col=s.where(m).ffill(),
index=s.groupby(m.cumsum()).cumcount())[~m]
.pivot(index='index', columns='col', values='value')
)
Output:
col A^ G.ELITE^
index
1 206-1C A201-1C
2 502-2B A202-1B
3 506-0.5C NaN
4 604-1B NaN
5 907-2B NaN
Upvotes: 2