extract multiple sub-fields from Pandas dataframe column into a new dataframe

Question

I have a Pandas dataframe (approx 100k rows) as my input. It is an export from a database, and each of the fields in one of the columns contain one or more records which I need to expand into independent records. For example:

record_id	text_field
0	r0_sub_record1_field1@r0_sub_record1_field2#r0_sub_record2_field1@r0_sub_record2_field2#
1	sub_record1_field1@sub_record1_field2#
2	sub_record1_field1@sub_record1_field2#sub_record2_field1@sub_record2_field2#sub_record3_field1@sub_record3_field2#

The desired result should look like this:

record_id	field1	field2	original_record_id
0	r0_sub_record1_field1	r0_sub_record1_field2	0
1	r0_sub_record2_field1	r0_sub_record2_field2	0
2	r1_sub_record1_field1	r1_sub_record1_field2	1
3	r2_sub_record1_field1	r2_sub_record1_field2	2
4	r2_sub_record2_field1	r2_sub_record2_field2	2
5	r2_sub_record3_field1	r2_sub_record3_field2	2

It is quite straight-forward how to extract the data I need using a loop, but I suspect it is not the most efficient and also not the nicest way. As I understand it, I cannot use apply or map here, because I am building another dataframe with the extracted data.

Is there a good Python-esque and Panda-style way to solve the problem?

I am using Python 3.7 and Pandas 1.2.1.

Corralien · Accepted Answer

Is it what you expect?

out = df['text_field'].str.strip('#').str.split('#').explode() \
                      .str.split('@').apply(pd.Series)

prefix = 'r' + out.index.map(str) + '_'

out.apply(lambda v: prefix + v).reset_index() \
   .rename(columns={0: 'field1', 1: 'field2', 'index': 'original_record_id'})

>>> out
   original_record_id                 field1                 field2
0                   0  r0_sub_record1_field1  r0_sub_record1_field2
1                   0  r0_sub_record2_field1  r0_sub_record2_field2
2                   1  r1_sub_record1_field1  r1_sub_record1_field2
3                   2  r2_sub_record1_field1  r2_sub_record1_field2
4                   2  r2_sub_record2_field1  r2_sub_record2_field2
5                   2  r2_sub_record3_field1  r2_sub_record3_field2

extract multiple sub-fields from Pandas dataframe column into a new dataframe

Answers (2)

Related Questions