Thanh Nguyen
Thanh Nguyen

Reputation: 912

Pandas: Extract string after maximum nth delimiter

My data looks like this:

        A_1         A_2         B_1         B_2         C_1         C_2
DP0001  11:01:01    11:01:01    15:35       51:02:01    07:02:04    15:02:01:01
DP0002  02:01:01:02 26:01:01    35:01:01:01 40:01:02    04:82       08:01:01
DP0003  11:01:01    11:01:01:01 15:02:01    56:04       01:02:01    08:01:01
DP0004  11:01:01    34:01:01    15:02:01    15:12       03:03:01:03 08:01:01

I want to extract the string before at most 3 :. The desire output:

        A_1         A_2         B_1         B_2         C_1         C_2
DP0001  11:01:01    11:01:01    15:35       51:02:01    07:02:04    15:02:01
DP0002  02:01:01    26:01:01    35:01:01    40:01:02    04:82       08:01:01
DP0003  11:01:01    11:01:01    15:02:01    56:04       01:02:01    08:01:01
DP0004  11:01:01    34:01:01    15:02:01    15:12       03:03:01    08:01:01

I tried multiple ways but none is successful so far:

  1. Using regex:
    df.stack().str.extract('^([\w]+:[\w]+:[\w]+)').unstack()[0]

However, it cannot capture string with only 1 : and return NaN for such case.

  1. Using split then join strategy
    def six_digits(df_columns):
        return df_columns.str.split(':', expand=True).iloc[:, :3].apply(':'.join, axis=1)
    
    df.apply(lambda x: six_digits(x))

Then it encounters None at cell with only : and throws an error.

Could you please advise me how to deal with this case? Thank you

Upvotes: 2

Views: 386

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133650

With your shown samples, could you please try following. Simple explanation would be: using replace function and mentioning regex to match 2 digits colon 2 digits colon 2 digits and leaving everything else as its not needed as per OP's samples.

df.astype(str).replace(r'^((?:\d{2}:){2}\d{2})(.*)', r'\1', regex=True)

Output will be as follows:

             A_1       A_2       B_1       B_2       C_1       C_2
DP0001  11:01:01  11:01:01     15:35  51:02:01  07:02:04  15:02:01
DP0002  02:01:01  26:01:01  35:01:01  40:01:02     04:82  08:01:01
DP0003  11:01:01  11:01:01  15:02:01     56:04  01:02:01  08:01:01
DP0004  11:01:01  34:01:01  15:02:01     15:12  03:03:01  08:01:01

Upvotes: 2

Shubham Sharma
Shubham Sharma

Reputation: 71687

DataFrame.replace

df.replace(r'^(\w+:\w+:\w+)(:.+)?', r'\1', regex=True)

             A_1       A_2       B_1       B_2       C_1       C_2
DP0001  11:01:01  11:01:01     15:35  51:02:01  07:02:04  15:02:01
DP0002  02:01:01  26:01:01  35:01:01  40:01:02     04:82  08:01:01
DP0003  11:01:01  11:01:01  15:02:01     56:04  01:02:01  08:01:01
DP0004  11:01:01  34:01:01  15:02:01     15:12  03:03:01  08:01:01

Regex details:

  • ^ : Assert position at the start of line
  • (\w+:\w+:\w+) : First capturing group
    • \w+ : Matches any word character one or more times
    • : : Matches the character : literally
  • (:.+) : Second capturing group
    • : : Matches the character : literally
    • .+ : Matches any character one or more times
  • ? : Matches the previous token between zero and one time

See the online regex demo

Upvotes: 4

Related Questions