Reputation: 1117
Importing a CSV as a pandas dataframe and dropping all completely empty columns:
import pandas as pd
df1 = pd.read_csv("name.csv")
df1 = df1.dropna(axis=1,how='all')
Alas one Column looks like:
'Background\r\n * find it: IDE-3: Some Name\r\n * Dokument: SomeName.pptx\r\n * Field: TEG-33\r\n * happy: Done\r\n\r\nh3. Definition\r\n\r\n\xa0tbd.\r\nh3. exists\r\n\r\ncsv\r\nh3. Source\r\n\r\ncsv?\r\n\r\npotentiell?\r\n\r\ntbd\r\nh3. task\r\n\r\ntbd\r\n\r\n\xa0'
Question1: I would like to remove all \r\n and \r\n\r\ and \r\n\r\n\ and \r\n\r\n\xa0, etc. Can anyone help with a regex? I cannot find a clear pattern.
Question2: How to prevent all these various forms of \r\n\r\ (see question 2) to be written while importing the CSV into a pandas data frame in the first place?
After cleaning all rows of the mentioned column in the data frame the end result should like
(Python 3, Anaconda3 Distribution, on Windows 10)
Upvotes: 2
Views: 3423
Reputation: 24301
This regex will achieve what you want:
(\r\n)+(\r)*(\xa0)*
Explanation:
(\r\n)+ # One or more copies of '\r\n'
(\r)* # Any extra appended '\r'
(\xa0)* # Any final appended '\xao'
Though note that in your example there are no strings of the form \r\n...\r
i.e. with a final appended \r
.
Upvotes: 1
Reputation: 325
For question 1:
(df1['Column 3']
.str.replace('\r','')
.str.replace('\n','')
.str.replace('\xa0', ''))
For question 2: You could clean that data as it goes into the csv - but hard to say without knowing where the data comes from!
Upvotes: 1