BlackAndWhite
BlackAndWhite

Reputation: 73

How remove multiple characters from multiple txt files

I'm trying to do a script to automate a simple task of removing characters from txt files and I want to save it with the same name but without the chars. I have multiple txt files: e.g 1.txt, 2.txt ... 200.txt, stored in a directory (Documents). I have a txt file with the characters I want to remove. At the beginning I though to compare my chars_to_remove.txt to all my different files (1.txt, 2.txt...) but I could find a way to do so. Instead, I created a string with all those chars I want to remove.

Let's say I have the following string in 1.txt file:

Mean concentrations α, maximum value ratio β and reductions in NO2 due to the lockdown Δ, March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).

I want to remove α, β, and Δ chars from the string. This is my code as far.

import glob 
import os 

chars_to_remove = '‘’“”|n.d.…•∈αβδΔεθϑφΣμτσχ€$∞http:www.←→≥≤<>▷×°±*⁃'

file_location = os.path.join('Desktop', 'Documents', '*.txt')
file_names = glob.glob(file_location)
print(file_names)

for f in file_names:
    outfile = open(f,'r',encoding='latin-1')
    data = outfile.read()
    if chars_to_remove in data:
        data.replace(chars_to_remove, '')
    outfile.close()

The variable data stores in each iteration all the content from the txt files. I want to check if there are chars_to_remove in the string and remove it with replace() function. I tried different approaches suggested here and here without sucess.

Also, I tried to compare it as a list:

chars_to_remove = ['‘','’','“','”','|','n.d.','…','•','∈','α','β','δ','Δ','ε','θ','ϑ','φ','Σ','μ','τ','σ','χ','€','$','∞','http:','www.','←','→','≥','≤','<','>','▷','×','°','±','*','⁃']

but got datatype errors when comparing.

Any further idea will be appreciated!

Upvotes: 1

Views: 572

Answers (2)

Keivan Ipchi Hagh
Keivan Ipchi Hagh

Reputation: 318

It may not be as fast, but why not use Regex to remove the characters/phrases?

import re

pattern = re.compile(r"(‘|’|“|”|\||n.d.|…|•|∈|α|β|δ|Δ|ε|θ|ϑ|φ|Σ|μ|τ|σ|χ|€|$|∞|http:|www.|←|→|≥|≤|<|>|▷|×|°|±|\*|⁃)")
result = pattern.sub("", 'Mean concentrations α, maximum value ratio β and reductions in NO2 due to the lockdown Δ, March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).')
print(result)

Output

Mean concentrations , maximum value ratio  and reductions in NO2 due to the lockdown , March 2020, 2019 and 2018 in Madrid and Barcelona (Spain).

Upvotes: 1

Glauco
Glauco

Reputation: 1465

Most efficient way is string.translate in order to avoit loop on each invalid char. Outfile must be define in some manner.

import glob 
import os
from string import maketrans

chars_to_remove = '‘’“”|n.d.…•∈αβδΔεθϑφΣμτσχ€$∞http:www.←→≥≤<>▷×°±*⁃'
translator = maketrans(chars_to_remove,'\0'*len(chars_to_remove))

file_location = os.path.join('Desktop', 'Documents', '*.txt')
file_names = glob.glob(file_location)
print(file_names)

for f in file_names:
    infile = open(f,'r',encoding='latin-1')
    data = infile.read()
    data.translate(translator).replace('\0','')
    infile.close()
    
    #Now data is translated
    # You must write it in a new file
    with open('...','wt') as outfile:
        outfile.write(data)
        

Hit

This code works, but it is inefficient, files are fully loaded in memory. a better way is to roll over infile and in the meanwhile write on outfile.

Upvotes: 0

Related Questions