rogerwhite
rogerwhite

Reputation: 345

Python Pandas Quotation Marks

I am trying to load a large log file on pandas, but this file is not uniform. There is legacy and junk. Before I load the data on pandas, can I remove the first character of the row, if it is a quotation marks (")?

I am aware I could pre-clean the data before adding it to PD. However, that seems like an inefficient way. I would rater do it using pandas

Code:

df = pd.read_csv(file, sep='\n', header=None, engine='python', chunksize=10000)
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
print(df)

Data:

"[email protected]:datahere2    :  this row will throw an error
[email protected]:datahere2
[email protected]:datahere2

Upvotes: 0

Views: 114

Answers (1)

Stef
Stef

Reputation: 30579

Use read_csv with QUOTE_NONE (3) and then strip the quotation marks:

df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

Upvotes: 1

Related Questions