Pandas: Extract string after maximum nth delimiter

Question

My data looks like this:

        A_1         A_2         B_1         B_2         C_1         C_2
DP0001  11:01:01    11:01:01    15:35       51:02:01    07:02:04    15:02:01:01
DP0002  02:01:01:02 26:01:01    35:01:01:01 40:01:02    04:82       08:01:01
DP0003  11:01:01    11:01:01:01 15:02:01    56:04       01:02:01    08:01:01
DP0004  11:01:01    34:01:01    15:02:01    15:12       03:03:01:03 08:01:01

I want to extract the string before at most 3 :. The desire output:

        A_1         A_2         B_1         B_2         C_1         C_2
DP0001  11:01:01    11:01:01    15:35       51:02:01    07:02:04    15:02:01
DP0002  02:01:01    26:01:01    35:01:01    40:01:02    04:82       08:01:01
DP0003  11:01:01    11:01:01    15:02:01    56:04       01:02:01    08:01:01
DP0004  11:01:01    34:01:01    15:02:01    15:12       03:03:01    08:01:01

I tried multiple ways but none is successful so far:

Using regex:

    df.stack().str.extract('^([\w]+:[\w]+:[\w]+)').unstack()[0]

However, it cannot capture string with only 1 : and return NaN for such case.

Using split then join strategy

    def six_digits(df_columns):
        return df_columns.str.split(':', expand=True).iloc[:, :3].apply(':'.join, axis=1)
    
    df.apply(lambda x: six_digits(x))

Then it encounters None at cell with only : and throws an error.

Could you please advise me how to deal with this case? Thank you

Shubham Sharma · Accepted Answer

`DataFrame.replace`

df.replace(r'^(\w+:\w+:\w+)(:.+)?', r'\1', regex=True)

             A_1       A_2       B_1       B_2       C_1       C_2
DP0001  11:01:01  11:01:01     15:35  51:02:01  07:02:04  15:02:01
DP0002  02:01:01  26:01:01  35:01:01  40:01:02     04:82  08:01:01
DP0003  11:01:01  11:01:01  15:02:01     56:04  01:02:01  08:01:01
DP0004  11:01:01  34:01:01  15:02:01     15:12  03:03:01  08:01:01

Regex details:

^ : Assert position at the start of line
(\w+:\w+:\w+) : First capturing group
- \w+ : Matches any word character one or more times
- : : Matches the character : literally
(:.+) : Second capturing group
- : : Matches the character : literally
- .+ : Matches any character one or more times
? : Matches the previous token between zero and one time

See the online regex demo

Pandas: Extract string after maximum nth delimiter

Answers (2)

`DataFrame.replace`

Related Questions