Reputation: 45
I have a dataframe where some columns contain long strings (e.g. 30000 characters). I would like to split these columns every 4000 characters so that I end up with a range of new columns containing strings of length at most 4000. I have an upper bound on the string lengths so I know there should be at most 9 new columns. I would like there to always be 9 new columns, having None/NaN in columns where the string is shorter.
As an example (with n = 10 instead of 4000 and 3 columns instead of 9), let's say I have the dataframe:
df_test = pd.DataFrame({'id': [1, 2, 3],
'str_1': ['This is a long string', 'This is an even longer string', 'This is the longest string of them all'],
'str_2': ['This is also a long string', 'a short string', 'mini_str']})
id str_1 str_2
0 1 This is a long string This is also a long string
1 2 This is an even longer string a short string
2 3 This is the longest string of them all mini_str
In this case I want to get the result
id str_1_1 str_1_2 str_1_3 str_1_4 str_2_1 str_2_2 str_2_3
0 1 This is a long strin g NaN This is al so a long string
1 2 This is an even long er string NaN a short st ring NaN
2 3 This is th e longest string of them all mini_str NaN NaN
Here, I want e.g. first row, column str_1_3 to be a string of length 1.
I tried using
df_test['str_1'].str.split(r".{10}", expand=True, n=10)
but that didn't work. It gave this as result
0 1 2 3
0 g None
1 er string None
2 them all
where the first columns aren't filled.
I also tried looping through every row and inserting '|' every 10 characters and then splitting on '|' but that seems tedious and slow.
Any help is appreciated.
Upvotes: 3
Views: 2975
Reputation: 1571
The answer is quite simple, that is, insert a delimiter and split it.
For example, use |
as the delimiter and let n = 4
:
series = pd.Series(['This is an even longer string', 'This is the longest string of them all'],name='str1')
name = series.name
cols = series.str.replace('(.{10})', r'\1|').str.split('|', n=4, expand=True).add_prefix(f'{name}_')
That is, use str.replace
to add delimiter, use str.split
to split them apart and use add_prefix
to add the prefixes.
The output will be:
str1_0 str1_1 str1_2 str1_3
0 This is an even long er string None
1 This is th e longest string of them all
The reason why str.split('.{10}')
doesn't work is that the pat
param in the function str.split
is a pattern to match the strings as split delimiters but not strings that should be in splited results. Therefore, with str.split('.{10}')
, you get one character every 10-th chars.
UPDATE: Accroding to the suggestion from @AKX, use \x1F
as a better delimiter:
cols = series.str.replace('(.{10})', '\\1\x1F').str.split('\x1F', n=4, expand=True).add_prefix(f'{name}_')
Note the absence of the r
string flags.
Upvotes: 2