Shreyance Shaw
Shreyance Shaw

Reputation: 43

Extracting string from Regex in Pandas for large dataset

We have a csv file which contains log entries in each row. We need to extract the thread names from each log entry into a separate column.

What would be the fastest way to implement the same ?

The approach below (string functions) also seems to take alot of time for large datasets.

We have csv files with minimum of 100K entries in each csv file.

This is the piece of codes which extracts the path

df['thread'] = df.message.str.extract(pat = '(\[(\w+.)+?\]|$)')[0]

The below is a sample log entry, we are picking out:

[c.a.j.sprint_planning_resources.listener.RunAsyncEvent]

from the regex above.

2020-12-01 05:07:36,485-0500 ForkJoinPool.commonPool-worker-30 WARN Ives_Chen 245x27568399x23 oxk7fv 10.97.200.99,127.0.0.1 /browse/MDT-206838 [c.a.j.sprint_planning_resources.listener.RunAsyncEvent] Event processed:  com.atlassian.jira.event.issue.IssueEvent@5c8703d0[issue=ABC-61381,comment=<null>,worklog=<null>,changelog=[GenericEntity:ChangeGroup][issue,1443521][author,JIRAUSER39166][created,2020-12-01 05:07:36.377][id,15932782],eventTypeId=2,sendMail=true,params={eventsource=action, baseurl=https://min.com},subtasksUpdated=true,spanningOperation=Optional.empty]

Does anyone know a better/faster method to implement the same ?

Upvotes: 1

Views: 149

Answers (2)

MonkeyZeus
MonkeyZeus

Reputation: 20747

Your regex takes a whopping 8,572 steps to complete, see https://regex101.com/r/5c3vi7/1

You can use this regex to significantly cut down the regex processing to 4 steps:

\[[^\]]+\]

Do notice the absence of the /g modifier

https://regex101.com/r/6522P8/1

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627190

The \[(\w+.)+?\] is a very inefficient pattern that may cause catastrophic backtracking due to the nested quantifiers with an unescaped . that matches any char, and thus also matches what \w does.

You can use

df['thread'] = df['message'].str.extract(r'\[(\w+(?:\.\w+)*)]', expand=False).fillna("")

See this regex demo. Note there is no need adding $ as an alternative since .fillna("") will replace the NA with an empty string.

The regex matches

  • \[ - a [ char
  • (\w+(?:\.\w+)*) - Capturing group 1: one or more word chars followed with zero or more sequences of a . and one or more word chars
  • ] - a ] char.

Upvotes: 2

Related Questions