Raits
Raits

Reputation: 95

extract multiple sub-fields from Pandas dataframe column into a new dataframe

I have a Pandas dataframe (approx 100k rows) as my input. It is an export from a database, and each of the fields in one of the columns contain one or more records which I need to expand into independent records. For example:

record_id text_field
0 r0_sub_record1_field1@r0_sub_record1_field2#r0_sub_record2_field1@r0_sub_record2_field2#
1 sub_record1_field1@sub_record1_field2#
2 sub_record1_field1@sub_record1_field2#sub_record2_field1@sub_record2_field2#sub_record3_field1@sub_record3_field2#

The desired result should look like this:

record_id field1 field2 original_record_id
0 r0_sub_record1_field1 r0_sub_record1_field2 0
1 r0_sub_record2_field1 r0_sub_record2_field2 0
2 r1_sub_record1_field1 r1_sub_record1_field2 1
3 r2_sub_record1_field1 r2_sub_record1_field2 2
4 r2_sub_record2_field1 r2_sub_record2_field2 2
5 r2_sub_record3_field1 r2_sub_record3_field2 2

It is quite straight-forward how to extract the data I need using a loop, but I suspect it is not the most efficient and also not the nicest way. As I understand it, I cannot use apply or map here, because I am building another dataframe with the extracted data.

Is there a good Python-esque and Panda-style way to solve the problem?

I am using Python 3.7 and Pandas 1.2.1.

Upvotes: 1

Views: 477

Answers (2)

Corralien
Corralien

Reputation: 120559

Is it what you expect?

out = df['text_field'].str.strip('#').str.split('#').explode() \
                      .str.split('@').apply(pd.Series)

prefix = 'r' + out.index.map(str) + '_'

out.apply(lambda v: prefix + v).reset_index() \
   .rename(columns={0: 'field1', 1: 'field2', 'index': 'original_record_id'})
>>> out
   original_record_id                 field1                 field2
0                   0  r0_sub_record1_field1  r0_sub_record1_field2
1                   0  r0_sub_record2_field1  r0_sub_record2_field2
2                   1  r1_sub_record1_field1  r1_sub_record1_field2
3                   2  r2_sub_record1_field1  r2_sub_record1_field2
4                   2  r2_sub_record2_field1  r2_sub_record2_field2
5                   2  r2_sub_record3_field1  r2_sub_record3_field2

Upvotes: 1

Umar.H
Umar.H

Reputation: 23099

I think you need to explode based on # then split the @ text.

df1 = df.assign(t=df['text_field'].str.split('#')
                 ).drop('text_field',1).explode('t').reset_index(drop=True)

df2 = df1.join(df1['t'].str.split('@',expand=True)).drop('t',1)

print(df2.dropna())

   record_id                      0                      1
0          0  r0_sub_record1_field1  r0_sub_record1_field2
1          0  r0_sub_record2_field1  r0_sub_record2_field2
3          1     sub_record1_field1     sub_record1_field2
5          2     sub_record1_field1     sub_record1_field2
6          2     sub_record2_field1     sub_record2_field2
7          2     sub_record3_field1     sub_record3_field2

Upvotes: 3

Related Questions