Reputation: 3744
I tried this code:
import re
re.sub('\r\n\r\n','','Summary_csv.csv')
It did not do anything. As in, it did not even touch the file (there is no modification to the date and time of the file after running this code). Could anyone please explain why?
Then I tried this:
import re
output = open("Summary.csv","w", encoding="utf8")
input = open("Summary_csv.csv", encoding="utf8")
for line in input:
output.write(re.sub('\r\n\r\n','', line))
input.close()
output.close()
This one does something to the file, as in the modified data and time in the file changes after I run this code, but it does not remove the consecutive newlines, and the output is the same as the original file.
EDIT: This a small sample from the original csv file:
"The UK’s Civil Aviation Authority (CAA) has announced new passenger charge caps for Heathrow and Gatwick while deregulating Stansted. Under the Civil Aviation Act 2012 for the economic regulation of UK airport operators, the CAA conducts market power assessments (MPA) to judge their power within the aviation market and whether they need to be regulated. (....) As expected, the CAA’s price review published on January 10 requires Heathrow and Gatwick to continue their regulated status, though Stansted has been de-regulated, giving operator MAG the power to determine what levies are necessary.
Although the CAA had previously said Heathrow would be allowed to increase its charges in line with inflation, Heathrow and Gatwick’s price rises will be limited to 1.5% below the rate of inflation from April 1. These rules will run until December 31, 2018, for Heathrow and until March 31, 2021 for Gatwick. (....) CAA's Chair, Dame Deidre Hutton commented: “[Passengers] will see prices fall, whilst still being able to look forward to high service standards, thanks to a robust licensing regime.” Heathrow has stated the CAA’s price caps will result in its per passenger airline charges falling in real terms from £20.71 in 2013/14 to £19.10 in 2018/19. (....)
"
"The CAPA Airport Construction and Capex database presently has over USD385 billion of projects indicated globally, led by Asia with just over USD115 billion of projects either in progress or planned for and with a good chance of completion. China, with 69 regional airports to be constructed by 2015, is the most active, adding to the existing 193. But some Asian countries, notably India and Indonesia, each with extended near-or more than double digit growth, are lagging badly in introducing new infrastructure.
The Middle East is also undertaking major investment, notably in the Gulf airports, as the world-changing operations of its main airlines continue to expand rapidly. But Saudi Arabia and Oman are also embarked on major expansions.
Istanbul's new airport starts to take shape in 2014, with completion of the world's biggest facility due to be completed by 2019. Meanwhile, in Brazil, the race is on to have sufficient capacity in place for the football world cup, due to commence in Jun-2014. (....)
"
I want the output to be the following:
"The UK’s Civil Aviation Authority (CAA) has announced new passenger charge caps for Heathrow and Gatwick while deregulating Stansted. Under the Civil Aviation Act 2012 for the economic regulation of UK airport operators, the CAA conducts market power assessments (MPA) to judge their power within the aviation market and whether they need to be regulated. (....) As expected, the CAA’s price review published on January 10 requires Heathrow and Gatwick to continue their regulated status, though Stansted has been de-regulated, giving operator MAG the power to determine what levies are necessary. Although the CAA had previously said Heathrow would be allowed to increase its charges in line with inflation, Heathrow and Gatwick’s price rises will be limited to 1.5% below the rate of inflation from April 1. These rules will run until December 31, 2018, for Heathrow and until March 31, 2021 for Gatwick. (....) CAA's Chair, Dame Deidre Hutton commented: “[Passengers] will see prices fall, whilst still being able to look forward to high service standards, thanks to a robust licensing regime.” Heathrow has stated the CAA’s price caps will result in its per passenger airline charges falling in real terms from £20.71 in 2013/14 to £19.10 in 2018/19. (....)"
"The CAPA Airport Construction and Capex database presently has over USD385 billion of projects indicated globally, led by Asia with just over USD115 billion of projects either in progress or planned for and with a good chance of completion. China, with 69 regional airports to be constructed by 2015, is the most active, adding to the existing 193. But some Asian countries, notably India and Indonesia, each with extended near-or more than double digit growth, are lagging badly in introducing new infrastructure.The Middle East is also undertaking major investment, notably in the Gulf airports, as the world-changing operations of its main airlines continue to expand rapidly. But Saudi Arabia and Oman are also embarked on major expansions.Istanbul's new airport starts to take shape in 2014, with completion of the world's biggest facility due to be completed by 2019. Meanwhile, in Brazil, the race is on to have sufficient capacity in place for the football world cup, due to commence in Jun-2014. (....)"
Upvotes: 0
Views: 985
Reputation: 4855
The answer to your question is that re.sub
is being applied to the string 'Summary_csv.csv'
not the file. It expects a string for the third argument and it does the substitution on that string.
In the second piece of code, you open the file and read it one line at a time. This means that no line will ever contain two newlines. Two newlines will result in two consecutive lines being returned from the input file with the second line being empty.
To get rid of the extra new lines, just test for a blank line
and don't write it to the output
. Calling line.strip()
on an empty line (one containing only whitespace characters) will return an empty string which will evaluate to False
in an if
statement. If line.strip()
isn't empty, then write it to your output file.
output = open("Summary.csv","w", encoding="utf8")
infile = open("Summary_csv.csv", encoding="utf8")
for line in infile:
if line.strip():
output.write(line)
infile.close()
output.close()
Note: Python treats text files in a platform-independent way and converts line endings to '\n' by default, so testing for '\r\n' wouldn't work even without the other problems. If you really want the endings to be '\r\n', you must specify newline='\r\n'
when you call open()
for the input file. See the documentation on https://docs.python.org/3/library/functions.html#open for a full explanation.
With the example input and output files posted by the OP, it appears that the problem was more complex than stripping extra newlines. The following code reads the input file, finds text between pairs of "
characters and combines all of the lines onto a single line in the output file. Extra newlines not inside "
are sent to the output file unaltered.
import re
outfile = open("Summary.csv","w", encoding="utf8")
infile = open("Summary_csv.csv", encoding="utf8")
text = infile.read()
text = re.sub('\n\n', '\n', text) #remove double newlines
for p in re.split('(\".+?\")', text, flags=re.DOTALL):
if p: #skip empty matches
if p.strip(): #this is a paragraph of text and should be a line
p = p[1:-2] #get everything between the quotes
p = p.strip() #remove leading and trailing whitespace
p = re.sub('\n+', ' ', p) #replace any remaining \n with two spaces
p = '"' + p + '"\n' #replace the " around the paragraph and add newline
outfile.write(p)
infile.close()
outfile.close()
Upvotes: 2