Reputation: 43
We have a csv file which contains log entries in each row. We need to extract the thread names from each log entry into a separate column.
What would be the fastest way to implement the same ?
The approach below (string functions) also seems to take alot of time for large datasets.
We have csv files with minimum of 100K entries in each csv file.
This is the piece of codes which extracts the path
df['thread'] = df.message.str.extract(pat = '(\[(\w+.)+?\]|$)')[0]
The below is a sample log entry, we are picking out:
[c.a.j.sprint_planning_resources.listener.RunAsyncEvent]
from the regex above.
2020-12-01 05:07:36,485-0500 ForkJoinPool.commonPool-worker-30 WARN Ives_Chen 245x27568399x23 oxk7fv 10.97.200.99,127.0.0.1 /browse/MDT-206838 [c.a.j.sprint_planning_resources.listener.RunAsyncEvent] Event processed: com.atlassian.jira.event.issue.IssueEvent@5c8703d0[issue=ABC-61381,comment=<null>,worklog=<null>,changelog=[GenericEntity:ChangeGroup][issue,1443521][author,JIRAUSER39166][created,2020-12-01 05:07:36.377][id,15932782],eventTypeId=2,sendMail=true,params={eventsource=action, baseurl=https://min.com},subtasksUpdated=true,spanningOperation=Optional.empty]
Does anyone know a better/faster method to implement the same ?
Upvotes: 1
Views: 149
Reputation: 20747
Your regex takes a whopping 8,572 steps to complete, see https://regex101.com/r/5c3vi7/1
You can use this regex to significantly cut down the regex processing to 4 steps:
\[[^\]]+\]
Do notice the absence of the /g
modifier
https://regex101.com/r/6522P8/1
Upvotes: 1
Reputation: 627190
The \[(\w+.)+?\]
is a very inefficient pattern that may cause catastrophic backtracking due to the nested quantifiers with an unescaped .
that matches any char, and thus also matches what \w
does.
You can use
df['thread'] = df['message'].str.extract(r'\[(\w+(?:\.\w+)*)]', expand=False).fillna("")
See this regex demo. Note there is no need adding $
as an alternative since .fillna("")
will replace the NA
with an empty string.
The regex matches
\[
- a [
char(\w+(?:\.\w+)*)
- Capturing group 1: one or more word chars followed with zero or more sequences of a .
and one or more word chars]
- a ]
char.Upvotes: 2