Reputation: 1117

Cleaning single Column within pandas Dataframe

Importing a CSV as a pandas dataframe and dropping all completely empty columns:

import pandas as pd 

df1 = pd.read_csv("name.csv") 
df1 = df1.dropna(axis=1,how='all')

Alas one Column looks like:

'Background\r\n * find it: IDE-3: Some Name\r\n * Dokument: SomeName.pptx\r\n * Field: TEG-33\r\n  * happy: Done\r\n\r\nh3. Definition\r\n\r\n\xa0tbd.\r\nh3. exists\r\n\r\ncsv\r\nh3. Source\r\n\r\ncsv?\r\n\r\npotentiell?\r\n\r\ntbd\r\nh3. task\r\n\r\ntbd\r\n\r\n\xa0'

Question1: I would like to remove all \r\n and \r\n\r\ and \r\n\r\n\ and \r\n\r\n\xa0, etc. Can anyone help with a regex? I cannot find a clear pattern.

Question2: How to prevent all these various forms of \r\n\r\ (see question 2) to be written while importing the CSV into a pandas data frame in the first place?

After cleaning all rows of the mentioned column in the data frame the end result should like

(Python 3, Anaconda3 Distribution, on Windows 10)

Upvotes: 2

Answers (2)

iacob

Reputation: 24301

Question 1

This regex will achieve what you want:

(\r\n)+(\r)*(\xa0)*

Explanation:

(\r\n)+  # One or more copies of '\r\n'
(\r)*    # Any extra appended    '\r'
(\xa0)*  # Any final appended    '\xao'

Though note that in your example there are no strings of the form \r\n...\r i.e. with a final appended \r.

Upvotes: 1

Nick Tallant

Reputation: 325

For question 1:

(df1['Column 3']
.str.replace('\r','')
.str.replace('\n','')
.str.replace('\xa0', ''))

For question 2: You could clean that data as it goes into the csv - but hard to say without knowing where the data comes from!

Upvotes: 1

Cleaning single Column within pandas Dataframe

Answers (2)

Question 1

Related Questions