Chris
Chris

Reputation: 22237

Simple regex problem: Removing all new lines from a file

I'm becoming acquainted with python and am creating problems in order to help myself learn the ins and outs of the language. My next problem comes as follows:

I have copied and pasted a huge slew of text from the internet, but the copy and paste added several new lines to break up the huge string. I wish to programatically remove all of these and return the string into a giant blob of characters. This is obviously a job for regex (I think), and parsing through the file and removing all instances of the newline character sounds like it would work, but it doesn't seem to be going over all that well for me.

Is there an easy way to go about this? It seems rather simple.

Upvotes: 24

Views: 77147

Answers (6)

Alix Axel
Alix Axel

Reputation: 154513

import re
re.sub(r"\n", "", file_contents_here)

Upvotes: 4

Timothy C. Quinn
Timothy C. Quinn

Reputation: 4475

All the examples using <string>.replace('\n','') is the correct method to remove all carriage returns.

If you are interested in removing all redundant new lines for debugging etc., here is how:

import re
re.sub(r"(\n)\1{2,}", "", _your_string).strip()

Upvotes: 0

bgibson
bgibson

Reputation: 19105

Old question, but since it was in my search results for a similar query, and no one has mentioned the python string functions strip() || lstrip() || rstrip(), I'll just add that for posterity (and anyone who prefers not to use re when not necessary):

old = open('infile.txt')
new = open('outfile.txt', 'w')
stripped = [line.strip() for line in old]
old.close()
new.write("".join(stripped))
new.close()

Upvotes: 0

Cascabel
Cascabel

Reputation: 496772

I know this is a python learning problem, but if you're ever trying to do this from the command-line, there's no need to write a python script. Here are a couple of other ways:

cat $FILE | tr -d '\n'

awk '{printf("%s", $0)}' $FILE

Neither of these has to read the entire file into memory, so if you've got an enormous file to process, they might be better than the python solutions provided.

Upvotes: 3

Alex Martelli
Alex Martelli

Reputation: 881567

The two main alternatives: read everything in as a single string and remove newlines:

clean = open('thefile.txt').read().replace('\n', '')

or, read line by line, removing the newline that ends each line, and join it up again:

clean = ''.join(l[:-1] for l in open('thefile.txt'))

The former alternative is probably faster, but, as always, I strongly recommend you MEASURE speed (e.g., use python -mtimeit) in cases of your specific interest, rather than just assuming you know how performance will be. REs are probably slower, but, again: don't guess, MEASURE!

So here are some numbers for a specific text file on my laptop:

$ python -mtimeit -s"import re" "re.sub('\n','',open('AV1611Bible.txt').read())"
10 loops, best of 3: 53.9 msec per loop
$ python -mtimeit "''.join(l[:-1] for l in open('AV1611Bible.txt'))"
10 loops, best of 3: 51.3 msec per loop
$ python -mtimeit "open('AV1611Bible.txt').read().replace('\n', '')"
10 loops, best of 3: 35.1 msec per loop

The file is a version of the KJ Bible, downloaded and unzipped from here (I do think it's important to run such measurements on one easily fetched file, so others can easily reproduce them!).

Of course, a few milliseconds more or less on a file of 4.3 MB, 34,000 lines, may not matter much to you one way or another; but as the fastest approach is also the simplest one (far from an unusual occurrence, especially in Python;-), I think that's a pretty good recommendation.

Upvotes: 37

RichieHindle
RichieHindle

Reputation: 281385

I wouldn't use a regex for simply replacing newlines - I'd use string.replace(). Here's a complete script:

f = open('input.txt')
contents = f.read()
f.close()
new_contents = contents.replace('\n', '')
f = open('output.txt', 'w')
f.write(new_contents)
f.close()

Upvotes: 10

Related Questions