Reputation: 6009
Using Python 3 and pandas 0.19.2
I have a log file formatted this way:
[Header1][Header2][Header3][HeaderN]
[=======][=======][=======][=======]
[Value1][Value2][Value3][ValueN]
[AnotherValue1][ValuesCanBeEmpty][][]
...
...which is very much like a CSV excepted that each value is surrounded by [
and ]
and there is no real delimiter.
What would be the most efficient way to load that content into a pandas DataFrame ?
Upvotes: 1
Views: 2504
Reputation: 61987
I think you can remove ]
from the beginning of each line and [
from the end of each line and then read with the delimiter ][
.
s ='''[Header1][Header2][Header3][HeaderN]
[=======][=======][=======][=======]
[Value1][Value2][Value3][ValueN]
[AnotherValue1][ValuesCanBeEmpty][][]'''
f = StringIO(s)
s1 = ''.join([line.lstrip('[').rstrip('\n]') + '\n' for line in f.readlines()])
pd.read_csv(StringIO(s1), sep='\]\[', engine='python')
Output
Header1 Header2 Header3 HeaderN
0 ======= ======= ======= =======
1 Value1 Value2 Value3 ValueN
2 AnotherValue1 ValuesCanBeEmpty NaN NaN
Upvotes: 1
Reputation: 863146
You can use read_csv
with separator ][
which has to be escape by \
. Then replace
columns and values and remove row with all NaN
by dropna
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""[Header1][Header2][Header3][HeaderN]
[=======][=======][=======][=======]
[Value1][Value2][Value3][ValueN]
[AnotherValue1][ValuesCanBeEmpty][][]"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\]\[", engine='python')
df.columns = df.columns.to_series().replace(['^\[', '\]$'],['',''], regex=True)
df = df.replace(['^\[', '\]$', '=', ''], ['', '', np.nan, np.nan], regex=True)
df = df.dropna(how='all')
print (df)
Header1 Header2 Header3 HeaderN
1 Value1 Value2 Value3 ValueN
2 AnotherValue1 ValuesCanBeEmpty NaN NaN
print (df.columns)
Index(['Header1', 'Header2', 'Header3', 'HeaderN'], dtype='object')
Upvotes: 1