Reputation: 345
I am trying to load a large log file on pandas, but this file is not uniform. There is legacy and junk. Before I load the data on pandas, can I remove the first character of the row, if it is a quotation marks (")?
I am aware I could pre-clean the data before adding it to PD. However, that seems like an inefficient way. I would rater do it using pandas
Code:
df = pd.read_csv(file, sep='\n', header=None, engine='python', chunksize=10000)
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
print(df)
Data:
"[email protected]:datahere2 : this row will throw an error
[email protected]:datahere2
[email protected]:datahere2
Upvotes: 0
Views: 114
Reputation: 30579
Use read_csv
with QUOTE_NONE
(3
) and then strip the quotation marks:
df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
Upvotes: 1