Reputation: 912
My data looks like this:
A_1 A_2 B_1 B_2 C_1 C_2
DP0001 11:01:01 11:01:01 15:35 51:02:01 07:02:04 15:02:01:01
DP0002 02:01:01:02 26:01:01 35:01:01:01 40:01:02 04:82 08:01:01
DP0003 11:01:01 11:01:01:01 15:02:01 56:04 01:02:01 08:01:01
DP0004 11:01:01 34:01:01 15:02:01 15:12 03:03:01:03 08:01:01
I want to extract the string before at most 3 :
. The desire output:
A_1 A_2 B_1 B_2 C_1 C_2
DP0001 11:01:01 11:01:01 15:35 51:02:01 07:02:04 15:02:01
DP0002 02:01:01 26:01:01 35:01:01 40:01:02 04:82 08:01:01
DP0003 11:01:01 11:01:01 15:02:01 56:04 01:02:01 08:01:01
DP0004 11:01:01 34:01:01 15:02:01 15:12 03:03:01 08:01:01
I tried multiple ways but none is successful so far:
df.stack().str.extract('^([\w]+:[\w]+:[\w]+)').unstack()[0]
However, it cannot capture string with only 1 :
and return NaN for such case.
def six_digits(df_columns):
return df_columns.str.split(':', expand=True).iloc[:, :3].apply(':'.join, axis=1)
df.apply(lambda x: six_digits(x))
Then it encounters None
at cell with only :
and throws an error.
Could you please advise me how to deal with this case? Thank you
Upvotes: 2
Views: 386
Reputation: 133650
With your shown samples, could you please try following. Simple explanation would be: using replace function and mentioning regex to match 2 digits colon 2 digits colon 2 digits and leaving everything else as its not needed as per OP's samples.
df.astype(str).replace(r'^((?:\d{2}:){2}\d{2})(.*)', r'\1', regex=True)
Output will be as follows:
A_1 A_2 B_1 B_2 C_1 C_2
DP0001 11:01:01 11:01:01 15:35 51:02:01 07:02:04 15:02:01
DP0002 02:01:01 26:01:01 35:01:01 40:01:02 04:82 08:01:01
DP0003 11:01:01 11:01:01 15:02:01 56:04 01:02:01 08:01:01
DP0004 11:01:01 34:01:01 15:02:01 15:12 03:03:01 08:01:01
Upvotes: 2
Reputation: 71687
DataFrame.replace
df.replace(r'^(\w+:\w+:\w+)(:.+)?', r'\1', regex=True)
A_1 A_2 B_1 B_2 C_1 C_2
DP0001 11:01:01 11:01:01 15:35 51:02:01 07:02:04 15:02:01
DP0002 02:01:01 26:01:01 35:01:01 40:01:02 04:82 08:01:01
DP0003 11:01:01 11:01:01 15:02:01 56:04 01:02:01 08:01:01
DP0004 11:01:01 34:01:01 15:02:01 15:12 03:03:01 08:01:01
Regex details:
^
: Assert position at the start of line(\w+:\w+:\w+)
: First capturing group
\w+
: Matches any word character one or more times:
: Matches the character :
literally(:.+)
: Second capturing group
:
: Matches the character :
literally.+
: Matches any character one or more times?
: Matches the previous token between zero and one timeSee the online regex demo
Upvotes: 4