Anders G.
Anders G.

Reputation: 45

Splitting a pandas column every n characters

I have a dataframe where some columns contain long strings (e.g. 30000 characters). I would like to split these columns every 4000 characters so that I end up with a range of new columns containing strings of length at most 4000. I have an upper bound on the string lengths so I know there should be at most 9 new columns. I would like there to always be 9 new columns, having None/NaN in columns where the string is shorter.

As an example (with n = 10 instead of 4000 and 3 columns instead of 9), let's say I have the dataframe:

df_test = pd.DataFrame({'id': [1, 2, 3],
                        'str_1': ['This is a long string', 'This is an even longer string', 'This is the longest string of them all'],
                        'str_2': ['This is also a long string', 'a short string', 'mini_str']})

    id  str_1                                   str_2
0   1   This is a long string                   This is also a long string
1   2   This is an even longer string           a short string
2   3   This is the longest string of them all  mini_str

In this case I want to get the result

    id  str_1_1     str_1_2     str_1_3     str_1_4   str_2_1     str_2_2     str_2_3
0   1   This is a   long strin  g           NaN       This is al  so a long   string
1   2   This is an   even long  er string   NaN       a short st  ring        NaN
2   3   This is th  e longest   string of   them all  mini_str    NaN         NaN

Here, I want e.g. first row, column str_1_3 to be a string of length 1.

I tried using

df_test['str_1'].str.split(r".{10}", expand=True, n=10)

but that didn't work. It gave this as result

    0   1   2           3
0           g           None
1           er string   None
2                       them all

where the first columns aren't filled.

I also tried looping through every row and inserting '|' every 10 characters and then splitting on '|' but that seems tedious and slow.

Any help is appreciated.

Upvotes: 3

Views: 2975

Answers (1)

C.K.
C.K.

Reputation: 1571

The answer is quite simple, that is, insert a delimiter and split it.

For example, use | as the delimiter and let n = 4:

series = pd.Series(['This is an even longer string', 'This is the longest string of them all'],name='str1')
name = series.name
cols = series.str.replace('(.{10})', r'\1|').str.split('|', n=4, expand=True).add_prefix(f'{name}_')

That is, use str.replace to add delimiter, use str.split to split them apart and use add_prefix to add the prefixes.

The output will be:

    str1_0      str1_1      str1_2      str1_3
0   This is an  even long   er string   None
1   This is th  e longest   string of   them all

The reason why str.split('.{10}') doesn't work is that the pat param in the function str.split is a pattern to match the strings as split delimiters but not strings that should be in splited results. Therefore, with str.split('.{10}'), you get one character every 10-th chars.

UPDATE: Accroding to the suggestion from @AKX, use \x1F as a better delimiter:

cols = series.str.replace('(.{10})', '\\1\x1F').str.split('\x1F', n=4, expand=True).add_prefix(f'{name}_')

Note the absence of the r string flags.

Upvotes: 2

Related Questions