Python pandas read dataframe from custom file format

Question

Using Python 3 and pandas 0.19.2

I have a log file formatted this way:

[Header1][Header2][Header3][HeaderN]
[=======][=======][=======][=======]
[Value1][Value2][Value3][ValueN]
[AnotherValue1][ValuesCanBeEmpty][][]
...

...which is very much like a CSV excepted that each value is surrounded by [ and ] and there is no real delimiter. What would be the most efficient way to load that content into a pandas DataFrame ?

jezrael · Accepted Answer

You can use read_csv with separator ][ which has to be escape by \. Then replace columns and values and remove row with all NaN by dropna:

import pandas as pd
from pandas.compat import StringIO

temp=u"""[Header1][Header2][Header3][HeaderN]
[=======][=======][=======][=======]
[Value1][Value2][Value3][ValueN]
[AnotherValue1][ValuesCanBeEmpty][][]"""

#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\]$$", engine='python')
df.columns = df.columns.to_series().replace(['^\[', '$$$'],['',''], regex=True)
df = df.replace(['^$$', '$$$', '=', ''], ['', '', np.nan, np.nan], regex=True)
df = df.dropna(how='all')
print (df)
         Header1           Header2 Header3 HeaderN
1         Value1            Value2  Value3  ValueN
2  AnotherValue1  ValuesCanBeEmpty     NaN     NaN

print (df.columns)
Index(['Header1', 'Header2', 'Header3', 'HeaderN'], dtype='object')

Python pandas read dataframe from custom file format

Answers (2)

Related Questions