Parsing multiline logs using pyspark with regexp

Question

I struggle with pyspark to split a logfile into a dataframe that may contains multiline events. These events is what I need help with.

The log file has the format

2020-04-03T14:12:24,368 DEBUG [main] blabla bla bla bla
2020-04-03T14:12:24,371 DEBUG [main] bla bla bla bla 
2020-04-03T14:12:24,348 DEBUG [Thread-2] multiline log line bla bla 
bla bla bla
bla bla
    blablabla
2020-04-03T14:12:24,377 DEBUG [main] blabla bla bla bla

To split a single line into groups, I can simply use

log_pattern = r'(\d*-\d*-\d*)T(\d*:\d*:\d*,\d*)[ ]{1,}(DEBUG|INFO|WARN|FATAL|ERROR|TRACE)[ ]{1,}($$.*$$)[ ]{1,}(.*)'
logs_df = base_df.select(regexp_extract('value', log_pattern, 1).alias('date'),
                         regexp_extract('value', log_pattern, 2).alias('timestamp'),
                         regexp_extract('value', log_pattern, 3).alias('log_level'),
                         regexp_extract('value', log_pattern, 4).alias('application'),
                         regexp_extract('value', log_pattern, 5).alias('log_content'))
logs_df.show(10, truncate=True)

Output:

+----------+------------+---------+-----------+--------------------+
|      date|   timestamp|log_level|application|         log_content|
+----------+------------+---------+-----------+--------------------+
|2020-04-08|00:35:12,014|     INFO|     [main]|Log4J2Helper:68 -...|
|2020-04-08|00:35:12,014|     INFO|     [main]|Log4J2Helper:69 -...|
....

What I want is the log_content to contain the multiline log event. However, I don't understand how I should be able to split the lines with regards of the multiline comments. I've tried splitting, regexp lookahead but do not seem to get it right.

The spark.read.tex seems to have the option of a custom new line delimiter, but it cannot take regexp.

I thought of just parsing using the re module first, but since the log files are the size of gigabyte, I probably run into memory and processing problems.

Can someone direct me how I should handle these large multiline log files?

The fourth bird · Accepted Answer

You could capture all following lines that do not start with for example 1 or more digits and a hyphen using a negative lookahead in the last group (.*(?:\r?\n(?!\d+-).*)*)

Note that if you use \d*-\d*-\d* you could possibly also match -- as the quantifier * matches 0 or more times.

This part $$.*$$ can be written using a negated character class $$[^][]*$$ to prevent overmatching and make it a bit more performant.

(\d+-\d+-\d+)T(\d+:\d+:\d+,\d+)[ ]+(DEBUG|INFO|WARN|FATAL|ERROR|TRACE)[ ]+($$[^][]*$$)[ ]+(.*(?:\r?\n(?!\d+-).*)*)

Regex demo

Parsing multiline logs using pyspark with regexp

Answers (1)

Related Questions