Reputation: 1971
I want to split equal length string without the splitter and expand the dataframe.
Here is the test dataframe I am using:
sample1 = pd.DataFrame({
'TST': {1: 1535840000000, 2: 1535840000000},
'RCV': {1: 1535840000000, 2: 1535850000000},
'TCU': {1: 358272000000000, 2: 358272000000000},
'SPD': {1: '0', 2: '00000000000000710000007D007C00E2'}
})
As you can see, the SPD
column contains various length string without any splitter.
I want to split the SPD
column every 4 characters into new rows, then expand them to the dataframe.
TST RCV TCU SPD
0 1535840000000 1535840000000 358272000000000 0000
1 1535840000000 1535840000000 358272000000000 0000
2 1535840000000 1535840000000 358272000000000 0000
3 1535840000000 1535840000000 358272000000000 0071
4 1535840000000 1535840000000 358272000000000 0000
5 1535840000000 1535840000000 358272000000000 007D
6 1535840000000 1535840000000 358272000000000 007C
7 1535840000000 1535840000000 358272000000000 00E2
I tried to firstly generate a Series by using this:
pd.concat([pd.Series(re.findall('....', row['SPD'])) for _, row in sample1.iterrows()]).reset_index()
which gives
index 0
0 0 0000
1 1 0000
2 2 0000
3 3 0071
4 4 0000
5 5 007D
6 6 007C
7 7 00E2
But I could not expand it back the sample1
Upvotes: 1
Views: 88
Reputation: 88236
You could split the strings in SPD
every 4
characters using str.findall
, and then unnest the resulting dataframe with unnesting
from the linked solution:
sample1['SPD'] = sample1.SPD.str.ljust(4, '0').str.findall(r'.{4}?')
unnesting(sample1, ['SPD'])
SPD TST RCV TCU
1 0000 1535840000000 1535840000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0071 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 007D 1535840000000 1535850000000 358272000000000
2 007C 1535840000000 1535850000000 358272000000000
2 00E2 1535840000000 1535850000000 358272000000000
Upvotes: 2
Reputation: 13998
use Series.str.extractall and then join with the original df.
sample1.filter(regex='^(?!SPD)').join(
sample1.SPD.str.extractall('(?P<SPD>.{4})').reset_index(level=1, drop=True)
)
# TST RCV TCU SPD
#1 1535840000000 1535840000000 358272000000000 NaN
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0071
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 007D
#2 1535840000000 1535850000000 358272000000000 007C
#2 1535840000000 1535850000000 358272000000000 00E2
use inner join(... how='inner') if you want to exclude rows having less that 4-char SPD
.
Upvotes: 1
Reputation: 402603
You can use str.findall
, then repeat
the rows based on the number of 4 character slices from SPD.
from itertools import chain
spd4 = df.pop('SPD').str.findall(r'.{4}')
(pd.DataFrame(df.values.repeat(spd4.str.len(), axis=0), columns=df.columns)
.assign(SPD=list(chain.from_iterable(spd4))))
TST RCV TCU SPD
0 1535840000000 1535850000000 358272000000000 0000
1 1535840000000 1535850000000 358272000000000 0000
2 1535840000000 1535850000000 358272000000000 0000
3 1535840000000 1535850000000 358272000000000 0071
4 1535840000000 1535850000000 358272000000000 0000
5 1535840000000 1535850000000 358272000000000 007D
6 1535840000000 1535850000000 358272000000000 007C
7 1535840000000 1535850000000 358272000000000 00E2
Upvotes: 3