Guillaume
Guillaume

Reputation: 6009

Python pandas read dataframe from custom file format

Using Python 3 and pandas 0.19.2

I have a log file formatted this way:

[Header1][Header2][Header3][HeaderN]
[=======][=======][=======][=======]
[Value1][Value2][Value3][ValueN]
[AnotherValue1][ValuesCanBeEmpty][][]
...

...which is very much like a CSV excepted that each value is surrounded by [ and ] and there is no real delimiter. What would be the most efficient way to load that content into a pandas DataFrame ?

Upvotes: 1

Views: 2504

Answers (2)

Ted Petrou
Ted Petrou

Reputation: 61987

I think you can remove ] from the beginning of each line and [ from the end of each line and then read with the delimiter ][.

s ='''[Header1][Header2][Header3][HeaderN]
[=======][=======][=======][=======]
[Value1][Value2][Value3][ValueN]
[AnotherValue1][ValuesCanBeEmpty][][]'''
f = StringIO(s)

s1 = ''.join([line.lstrip('[').rstrip('\n]') + '\n' for line in f.readlines()])
pd.read_csv(StringIO(s1), sep='\]\[', engine='python')

Output

         Header1           Header2  Header3  HeaderN
0        =======           =======  =======  =======
1         Value1            Value2   Value3   ValueN
2  AnotherValue1  ValuesCanBeEmpty      NaN      NaN

Upvotes: 1

jezrael
jezrael

Reputation: 863146

You can use read_csv with separator ][ which has to be escape by \. Then replace columns and values and remove row with all NaN by dropna:

import pandas as pd
from pandas.compat import StringIO

temp=u"""[Header1][Header2][Header3][HeaderN]
[=======][=======][=======][=======]
[Value1][Value2][Value3][ValueN]
[AnotherValue1][ValuesCanBeEmpty][][]"""

#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\]\[", engine='python')
df.columns = df.columns.to_series().replace(['^\[', '\]$'],['',''], regex=True)
df = df.replace(['^\[', '\]$', '=', ''], ['', '', np.nan, np.nan], regex=True)
df = df.dropna(how='all')
print (df)
         Header1           Header2 Header3 HeaderN
1         Value1            Value2  Value3  ValueN
2  AnotherValue1  ValuesCanBeEmpty     NaN     NaN

print (df.columns)
Index(['Header1', 'Header2', 'Header3', 'HeaderN'], dtype='object')

Upvotes: 1

Related Questions