Python Pandas Quotation Marks

Question

I am trying to load a large log file on pandas, but this file is not uniform. There is legacy and junk. Before I load the data on pandas, can I remove the first character of the row, if it is a quotation marks (")?

I am aware I could pre-clean the data before adding it to PD. However, that seems like an inefficient way. I would rater do it using pandas

Code:

df = pd.read_csv(file, sep='
', header=None, engine='python', chunksize=10000)
df = df[0].str.strip().str.split('[,|;: 	]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
print(df)

Data:

"email1@foo.com:datahere2    :  this row will throw an error
email2@foo.com:datahere2
email3@foo.com:datahere2

Stef · Accepted Answer

Use read_csv with QUOTE_NONE (3) and then strip the quotation marks:

df = pd.read_csv(file, sep='
', header=None, engine='python', quoting=3)
df = df[0].str.strip(' 	"').str.split('[,|;: 	]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

Python Pandas Quotation Marks

Answers (1)

Related Questions