Reputation: 95
I have a Pandas dataframe (approx 100k rows) as my input. It is an export from a database, and each of the fields in one of the columns contain one or more records which I need to expand into independent records. For example:
record_id | text_field |
---|---|
0 | r0_sub_record1_field1@r0_sub_record1_field2#r0_sub_record2_field1@r0_sub_record2_field2# |
1 | sub_record1_field1@sub_record1_field2# |
2 | sub_record1_field1@sub_record1_field2#sub_record2_field1@sub_record2_field2#sub_record3_field1@sub_record3_field2# |
The desired result should look like this:
record_id | field1 | field2 | original_record_id |
---|---|---|---|
0 | r0_sub_record1_field1 | r0_sub_record1_field2 | 0 |
1 | r0_sub_record2_field1 | r0_sub_record2_field2 | 0 |
2 | r1_sub_record1_field1 | r1_sub_record1_field2 | 1 |
3 | r2_sub_record1_field1 | r2_sub_record1_field2 | 2 |
4 | r2_sub_record2_field1 | r2_sub_record2_field2 | 2 |
5 | r2_sub_record3_field1 | r2_sub_record3_field2 | 2 |
It is quite straight-forward how to extract the data I need using a loop, but I suspect it is not the most efficient and also not the nicest way. As I understand it, I cannot use apply or map here, because I am building another dataframe with the extracted data.
Is there a good Python-esque and Panda-style way to solve the problem?
I am using Python 3.7 and Pandas 1.2.1.
Upvotes: 1
Views: 477
Reputation: 120559
Is it what you expect?
out = df['text_field'].str.strip('#').str.split('#').explode() \
.str.split('@').apply(pd.Series)
prefix = 'r' + out.index.map(str) + '_'
out.apply(lambda v: prefix + v).reset_index() \
.rename(columns={0: 'field1', 1: 'field2', 'index': 'original_record_id'})
>>> out
original_record_id field1 field2
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
2 1 r1_sub_record1_field1 r1_sub_record1_field2
3 2 r2_sub_record1_field1 r2_sub_record1_field2
4 2 r2_sub_record2_field1 r2_sub_record2_field2
5 2 r2_sub_record3_field1 r2_sub_record3_field2
Upvotes: 1
Reputation: 23099
I think you need to explode
based on #
then split
the @
text.
df1 = df.assign(t=df['text_field'].str.split('#')
).drop('text_field',1).explode('t').reset_index(drop=True)
df2 = df1.join(df1['t'].str.split('@',expand=True)).drop('t',1)
print(df2.dropna())
record_id 0 1
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
3 1 sub_record1_field1 sub_record1_field2
5 2 sub_record1_field1 sub_record1_field2
6 2 sub_record2_field1 sub_record2_field2
7 2 sub_record3_field1 sub_record3_field2
Upvotes: 3