Reputation: 457
I have some simple data in a file that I'm reading in with pandas:
2018:08:23:07:35:22:INFO:__main__:Info logger message
There are no beginning or trailing tabs, spaces, etc. in the file.
I read that file into a dataframe using the following:
df = pandas.read_csv("/u01/app/DataLake/tester/tester.log", header=None, index_col=False, sep=r'(\d{4}:\d{2}:\d{2}:\d{2}:\d{2}:\d{2}):(.+):(.+):(.+)',engine='python')
However, I'm getting the following:
>>> print(df)
0 1 2 3 4 5
0 NaN 2018:08:23:07:35:22 INFO __main__ Info logger message NaN
Where is the first and last column (NaN values) coming from?
Python: 3.4.8 Pandas: 0.19.2
Upvotes: 0
Views: 69
Reputation: 2072
I'm actually surprised that your regex even worked!
The sep
parameter is for identifying where to split, not what tokens to recognize.
What you really want (actually, an equivalent for what you want) is a regex that can:
:
, unless the next 2 characters are digits followed by another :
.That can be achieved using some advanced regex matching, specifically "lookahead". See this page for a detailed explanation of that.
This should work for your example:
pd.read_csv(path_to_csv, sep=' |:(?!\d{2}:)', header=None, engine='python')
Upvotes: 1