user2006697
user2006697

Reputation: 1117

Cleaning single Column within pandas Dataframe

Importing a CSV as a pandas dataframe and dropping all completely empty columns:

import pandas as pd 

df1 = pd.read_csv("name.csv") 
df1 = df1.dropna(axis=1,how='all')

Alas one Column looks like:

'Background\r\n * find it: IDE-3: Some Name\r\n * Dokument: SomeName.pptx\r\n * Field: TEG-33\r\n  * happy: Done\r\n\r\nh3. Definition\r\n\r\n\xa0tbd.\r\nh3. exists\r\n\r\ncsv\r\nh3. Source\r\n\r\ncsv?\r\n\r\npotentiell?\r\n\r\ntbd\r\nh3. task\r\n\r\ntbd\r\n\r\n\xa0'

Question1: I would like to remove all \r\n and \r\n\r\ and \r\n\r\n\ and \r\n\r\n\xa0, etc. Can anyone help with a regex? I cannot find a clear pattern.

Question2: How to prevent all these various forms of \r\n\r\ (see question 2) to be written while importing the CSV into a pandas data frame in the first place?

After cleaning all rows of the mentioned column in the data frame the end result should like enter image description here

(Python 3, Anaconda3 Distribution, on Windows 10)

Upvotes: 2

Views: 3423

Answers (2)

iacob
iacob

Reputation: 24301

Question 1

This regex will achieve what you want:

(\r\n)+(\r)*(\xa0)*

Explanation:

(\r\n)+  # One or more copies of '\r\n'
(\r)*    # Any extra appended    '\r'
(\xa0)*  # Any final appended    '\xao'

Though note that in your example there are no strings of the form \r\n...\r i.e. with a final appended \r.

Upvotes: 1

Nick Tallant
Nick Tallant

Reputation: 325

For question 1:

(df1['Column 3']
.str.replace('\r','')
.str.replace('\n','')
.str.replace('\xa0', ''))

For question 2: You could clean that data as it goes into the csv - but hard to say without knowing where the data comes from!

Upvotes: 1

Related Questions