Pandas with RegExp Producing Leading and Trailing NAN columns

Question

I have some simple data in a file that I'm reading in with pandas:

2018:08:23:07:35:22:INFO:__main__:Info logger message

There are no beginning or trailing tabs, spaces, etc. in the file.

I read that file into a dataframe using the following:

df = pandas.read_csv("/u01/app/DataLake/tester/tester.log", header=None, index_col=False, sep=r'(\d{4}:\d{2}:\d{2}:\d{2}:\d{2}:\d{2}):(.+):(.+):(.+)',engine='python')

However, I'm getting the following:

>>> print(df)
     0                    1        2         3                       4   5
0  NaN  2018:08:23:07:35:22     INFO  __main__     Info logger message NaN

Where is the first and last column (NaN values) coming from?

Python: 3.4.8 Pandas: 0.19.2

Qusai Alothman · Accepted Answer

I'm actually surprised that your regex even worked!
The sep parameter is for identifying where to split, not what tokens to recognize.
What you really want (actually, an equivalent for what you want) is a regex that can:

Split on every space.
Split on : , unless the next 2 characters are digits followed by another :.

That can be achieved using some advanced regex matching, specifically "lookahead". See this page for a detailed explanation of that.

This should work for your example:

pd.read_csv(path_to_csv, sep=' |:(?!\d{2}:)', header=None, engine='python')

Pandas with RegExp Producing Leading and Trailing NAN columns

Answers (1)

Related Questions