Reputation: 15
I have a robot that brings me an html code like this:
<div class="std">
<p>CAR:
<span>Onix</span>
</p>
<p>MODEL: LTZ</p>
<p>
<span>COLOR:
<span>Black</span>
</p>
<p>ACESSORIES:
<span>ABS</span>
</p>
<p>
<span>DESCRIPTION:</span>
<span>The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.</span>
</p>
<p>TECHNICAL DETAIL:
<span>The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..</span>
</p>
</div>
I applied the code below to remove the HTML tags:
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr,'\n', html_code).strip()
It returns to me:
CAR: Onix
MODEL: LTZ
COLOR:
Black
ACESSORIES:
ABS
DESCRIPTION:
The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL:
The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..
Now I need to remove the line breaks to have something like this:
CAR: Onix
MODEL: LTZ
COLOR: Black
ACESSORIES: ABS
DESCRIPTION: The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL: The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..
I tried this code below, but it doesn't match the line breaks correctly:
cleantext = re.sub(r':\s*[\r\n]*', ': ', cleantext)
I tried this another code also:
cleantext = cleantext.replace(': \n', ': ')
It doesn't work also. How can I manage this?
Upvotes: 0
Views: 236
Reputation: 1877
I think there are two parts to your problem,
First is to Join the string in two lines like below
COLOR:
Black
to
COLOR: black
and then remove all empty lines
For first part you could use replace your re.sub
with following
cleantext = re.sub(r'(.*):\s*[\r\n](.*)', '\g<1>: \g<2>', cleantext)
And for removing empty lines it would be tricky to do it via re.sub so I would suggest to use
cleantext = "\n".join([line for line in cleantext.split('\n') if line.strip() != ''])
This would give you answer as expected
Upvotes: 1
Reputation: 905
I think this should work for you
>>> string = """
CAR: Onix
MODEL: LTZ
COLOR:
Black
ACESSORIES:
ABS
DESCRIPTION:
The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL:
The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..
"""
>>> list_string = string.split("\n\n\n")
>>> for each in list_string:
print each.replace("\n","").strip()
CAR: Onix
MODEL: LTZ
COLOR:Black
ACESSORIES:ABS
DESCRIPTION:
The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL:The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..
Upvotes: 0