Reputation: 181
I have a data frame (df
) with column col1
which has many rows and there are rows with a common string (Collection of numbers are
) and ending with different numbers(001, 002, 005
). I want to extract the rows between two string (Collection of numbers are 002
to Collection of numbers are 003
) and assign them to new column with same row name (Collection of numbers are 002
)
col1
0 Collection of numbers are 002
1 53
2 20
3 56
4 Collection of numbers are 003
5 236
6 325
7 Collection of numbers are 005
8 96
9 23
10 63
I want to convert the how above data frame to the following format.
0 Collection of numbers are 002 Collection of numbers are 003 Collection of numbers are 005
1 53 236 96
2 20 325 23
3 56 63
Note: No repeated numbers
Upvotes: 3
Views: 4387
Reputation: 29635
you can use set_index
and unstack
. I steal the idea of @Datanovice for extracting the name of the future columns and use groupby.cumcount
to get the future index numbers:
arrCollection = df['col1'].str.extract('(Collection.*)').ffill()[0].to_numpy()
df_f = df.set_index([df.groupby(arrCollection)['col1'].cumcount()-1,
arrCollection])['col1']\
.unstack().iloc[1:,:]
print (df_f)
Collection 002 Collection 003 Collection 005
0 53 236 96
1 20 325 23
2 56 NaN 63
Note: the name of the columns will be like in your example, I did not use the exact same input
Upvotes: 1
Reputation: 481
The answer provided by Datanovic seems good. An alternative solution would be following function:
def extract_columns(df, column, common_string):
df_list = df[column].tolist()
df_new = pd.DataFrame()
row_indices = []
cols = []
for ind, elem in enumerate(df_list):
if common_string in str(elem):
row_indices.append(ind)
cols.append(elem)
row_indices.append(len(df_list))
for ind, col in enumerate(cols):
df_new[col] = pd.Series(df_list[row_indices[ind]+1:row_indices[ind+1]])
return df_new
So with your example dataframe you will get following result when calling the function extract_columns(df, 'col1', 'Collection of numbers are')
Collection of numbers are 002 Collection of numbers are 003 Collection of numbers are 005
0 53 236.0 96
1 20 325.0 23
2 56 NaN 63
Upvotes: 0
Reputation: 528
in
col1
0 c of numbers are 002
1 1
2 2
3 3
4 c of numbers are 003
5 55
6 66
7 c of numbers are 005
8 45
9 23
10 12
11 456
12 56
for_concat = []
col = []
for i,r in df.iterrows():
if "numbers" in str(r["col1"]):
if col:
for_concat.append(pd.DataFrame(col,columns=[col_name]))
col_name = r["col1"]
col = []
else:
col_name = r["col1"]
else:
col.append(r["col1"])
for_concat.append(pd.DataFrame(col,columns=[col_name]))
out = pd.concat(for_concat, axis =1)
out:
c of numbers are 002 c of numbers are 003 c of numbers are 005
0 1.0 55.0 45
1 2.0 66.0 23
2 3.0 NaN 12
3 NaN NaN 456
4 NaN NaN 56
Upvotes: 0
Reputation: 23099
We could try ffill
and some basic resphaping using str.split
df['headers'] = df['col1'].str.extract('(Collection.*)').ffill()
df1 = df[~df['col1'].str.contains('Collection')].copy()
df1.groupby('headers').agg(','.join)['col1'].str.split(',',expand=True).T.rename_axis('',axis='columns')
out:
Collection of numbers are 002 Collection of numbers are 003 \
0 53 236
1 20 325
2 56 None
Collection of numbers are 005
0 96
1 23
2 63
Upvotes: 4