Learner
Learner

Reputation: 672

Removing whitespace and carriage return from a text file with Python

I am having a dataframe that contains 5 columns while doing data cleaning process i got a problem caused by the carriage return from the text file as shown in the exp below.

Input :

001|Baker St.
London|3|4|7
002|Penny Lane
Liverpool|88|5|7

Output:

001|Baker St. London|3|4|7
002|Penny Lane Liverpool|88|5|7

Any suggestions are welcome.

Upvotes: 0

Views: 1837

Answers (3)

mijiturka
mijiturka

Reputation: 494

The built-in strip() method that string objects provide does this; You can call it like this as you iterate over a line:

cleaned_up_line = line.strip()

As the Python str.strip() docs tell us, it also gets rid of whitespace, newlines, and other special characters - at the beginning and end of a string.

For example:

In [7]: with open('file', 'r') as f: 
   ...:     a = f.readlines() 
   ...:     print(a) 
   ...:                                                                                              
['the\n', 'file\n\r', 'is\n\r', 'here\n', '\n']

In [8]: with open('file', 'r') as f: 
   ...:     a = [line.strip() for line in f.readlines()] 
   ...:     print(a) 
   ...:                                                                                              
['the', 'file', 'is', 'here', '']

Upvotes: 1

Andreas
Andreas

Reputation: 9197

You can replace the \r like this:

with open("your.csv", "r") as myfile:
 data = myfile.read().replace('\r', '')

Example:

from io import StringIO

# second entry contains a carriage return \r
s = """91|AAA|2010|3
92|BB\rB|2011|4 
93|CCC|2012|5
"""

# StringIO simulates a loaded csv file:

# carriage return still there
StringIO(s).read()
# '91|AAA|2010|3\n92|BB\rB|2011|4\n93|CCC|2012|5\n'

# carriage return gone
StringIO(s).read().replace('\r', '')
# '91|AAA|2010|3\n92|BBB|2011|4\n93|CCC|2012|5\n'

With Pandas:

data = StringIO(StringIO(s).read().replace('\r', ''))
pd.read_csv(data, sep='|')

Out[35]: 
   91  AAA  2010  3
0  92  BBB  2011  4
1  93  CCC  2012  5

Upvotes: 1

C K
C K

Reputation: 16

You could match it with regex and remove it, i.e. re.sub('[\r\n]', '', inputline).

Upvotes: 0

Related Questions