Alin
Alin

Reputation: 389

Replace several words in a text with Python

I use the below code to remove all HTML tags from a file and convert it to a plain text. Moreover, I have to convert XML/HTML characters to ASCII ones. Here, I have 21 lines which read whole the text. It means if I want to convert a huge file, I have to expend a lot of resource to do this.

Do you have any idea to increase the efficiency of the code and increase its speed while decrease the usage of the resources?

# -*- coding: utf-8 -*-
import re

# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()

# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('‘',"""'""")
temp = temp.replace ('’',"""'""")
temp = temp.replace ('“',"""\"""")
temp = temp.replace ('”',"""\"""")
temp = temp.replace ('‚',""",""")
temp = temp.replace ('′',"""'""")
temp = temp.replace ('″',"""\"""")
temp = temp.replace ('«',"""«""")
temp = temp.replace ('»',"""»""")
temp = temp.replace ('‹',"""‹""")
temp = temp.replace ('›',"""›""")
temp = temp.replace ('&',"""&""")
temp = temp.replace ('–',""" – """)
temp = temp.replace ('—',""" — """)
temp = temp.replace ('®',"""®""")
temp = temp.replace ('©',"""©""")
temp = temp.replace ('™',"""™""")
temp = temp.replace ('¶',"""¶""")
temp = temp.replace ('•',"""•""")
temp = temp.replace ('·',"""·""")

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()

Upvotes: 3

Views: 1358

Answers (3)

Alin
Alin

Reputation: 389

The problem of using sting.tranlate() or string.maketran() is that when I use them I have to assign A char to another one. e.g.

print string.maketran("abc","123")

But, I need to assign a HTML/XML char like &lsquo; to the single quotation (') in ASCII. It means that I have to use the following code:

print string.maketran("&lsquo;","'")

It faces the following error:

ValueError: maketrans arguments must have same length

Whereas, if I use HTMLParser, it will convert all HTML/XML to ASCII without the above problem. I also have added a encode('utf-8') to solve the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 246: ordinal not in range(128)

# -*- coding: utf-8 -*-
import re
from HTMLParser import HTMLParser

# This file contains HTML.
file = open('input-file.txt', 'r')
temp = file.read()

# Replace all XML/HTML characters to ASCII ones.
temp = HTMLParser.unescape.__func__(HTMLParser, temp)

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)

# Encode the text to UTF-8 for preventing some errors.
result = result.encode('utf-8')
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()

Upvotes: 1

Shashank
Shashank

Reputation: 13869

My first instinct is string.translate() in combination with string.maketrans() This will make only one pass instead of several. Each call to str.replace() does its own pass of the entire string and you want to avoid that.

An example:

from string import ascii_lowercase, maketrans, translate

from_str = ascii_lowercase
to_str = from_str[-1]+from_str[0:-1]
foo = 'the quick brown fox jumps over the lazy dog.'
bar = translate(foo, maketrans(from_str, to_str))
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.

Upvotes: 1

avinash pandey
avinash pandey

Reputation: 1381

you can use string.translate()

from string import maketrans   # Required to call maketrans function.

intab = "string of original characters that need to be replaced"
outtab = "string of new characters"
trantab = maketrans(intab, outtab)# maketrans() is helper function in the string module to create a translation table

str = "this is string example....wow!!!";#you string
print str.translate(trantab);

Note that in python3 str.translate will be significantly slower than in python2, especially if you translate only few characters. This is because it must handle unicode characters and thus uses a dict to perform the translations instead of indexing a string.

Upvotes: 1

Related Questions