Reputation: 73
I'm trying to do a script to automate a simple task of removing characters from txt files and I want to save it with the same name but without the chars. I have multiple txt files: e.g 1.txt, 2.txt ... 200.txt, stored in a directory (Documents). I have a txt file with the characters I want to remove. At the beginning I though to compare my chars_to_remove.txt to all my different files (1.txt, 2.txt...) but I could find a way to do so. Instead, I created a string with all those chars I want to remove.
Let's say I have the following string in 1.txt file:
Mean concentrations α, maximum value ratio β and reductions in NO2 due to the lockdown Δ, March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).
I want to remove α
, β
, and Δ
chars from the string. This is my code as far.
import glob
import os
chars_to_remove = '‘’“”|n.d.…•∈αβδΔεθϑφΣμτσχ€$∞http:www.←→≥≤<>▷×°±*⁃'
file_location = os.path.join('Desktop', 'Documents', '*.txt')
file_names = glob.glob(file_location)
print(file_names)
for f in file_names:
outfile = open(f,'r',encoding='latin-1')
data = outfile.read()
if chars_to_remove in data:
data.replace(chars_to_remove, '')
outfile.close()
The variable data
stores in each iteration all the content from the txt files. I want to check if there are chars_to_remove
in the string and remove it with replace()
function. I tried different approaches suggested here and here without sucess.
Also, I tried to compare it as a list:
chars_to_remove = ['‘','’','“','”','|','n.d.','…','•','∈','α','β','δ','Δ','ε','θ','ϑ','φ','Σ','μ','τ','σ','χ','€','$','∞','http:','www.','←','→','≥','≤','<','>','▷','×','°','±','*','⁃']
but got datatype errors when comparing.
Any further idea will be appreciated!
Upvotes: 1
Views: 572
Reputation: 318
It may not be as fast, but why not use Regex to remove the characters/phrases?
import re
pattern = re.compile(r"(‘|’|“|”|\||n.d.|…|•|∈|α|β|δ|Δ|ε|θ|ϑ|φ|Σ|μ|τ|σ|χ|€|$|∞|http:|www.|←|→|≥|≤|<|>|▷|×|°|±|\*|⁃)")
result = pattern.sub("", 'Mean concentrations α, maximum value ratio β and reductions in NO2 due to the lockdown Δ, March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).')
print(result)
Mean concentrations , maximum value ratio and reductions in NO2 due to the lockdown , March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).
Upvotes: 1
Reputation: 1465
Most efficient way is string.translate in order to avoit loop on each invalid char. Outfile must be define in some manner.
import glob
import os
from string import maketrans
chars_to_remove = '‘’“”|n.d.…•∈αβδΔεθϑφΣμτσχ€$∞http:www.←→≥≤<>▷×°±*⁃'
translator = maketrans(chars_to_remove,'\0'*len(chars_to_remove))
file_location = os.path.join('Desktop', 'Documents', '*.txt')
file_names = glob.glob(file_location)
print(file_names)
for f in file_names:
infile = open(f,'r',encoding='latin-1')
data = infile.read()
data.translate(translator).replace('\0','')
infile.close()
#Now data is translated
# You must write it in a new file
with open('...','wt') as outfile:
outfile.write(data)
This code works, but it is inefficient, files are fully loaded in memory. a better way is to roll over infile and in the meanwhile write on outfile.
Upvotes: 0