Reputation: 32226
I need to compare 2 files and remove the words from the text file those are found in the second (exclude list) file.
# cat remove.txt
test
junk
trash
unwanted
bad
worse
# cat corpus.txt
this is a test message to check if bad words are removed correctly. The second line may or may not have unwanted words. The third line also need not be as clean as first and second line.
There can be paragraphs in the text corpus and the entire file should be checked for trash.
This python code is working as expected.
import re
stop_words = list()
with open("remove.txt", "r") as f:
for i in f.readlines():
stop_words.append(i.replace("\n", ""))
# !> filteredtext.txt
file1 = open("corpus.txt")
line = file1.read()
words = line.split()
for r in words:
r = re.sub(r"[^\w\s]", "", r)
if not r in stop_words:
appendFile = open("filteredtext.txt", "a")
appendFile.write(" " + r)
appendFile.close()
I will like to know if there is linux command line magic possible in this case. The regular expression mentioned in the python code is optional. The cleaned text need not be 100% clean. More than 90% accuracy is ok.
Expected output:
this is a message to check if words are removed correctly The second line may or may not have words The third line also need not be as clean as first and second line There can be paragraphs in the text corpus and the entire file should be checked for
Upvotes: 1
Views: 243
Reputation: 786359
You may use this gnu awk
command:
awk -v RS='[[:space:]]+' 'FNR == NR {seen[$1]; next} !($1 in seen) {ORS=RT; print}' remove.txt corpus.txt
On a 450MB remove.txt
file above awk
command took 1 min 16 sec
to complete.
To make it more readable:
awk -v RS='[[:space:]]+' 'FNR == NR {
seen[$1]
next
}
!($1 in seen) {
ORS = RT
print
}' remove.txt corpus.txt
Earlier Solution: Using a single gnu sed
script:
sed -f <(sed 's~.*~s/ *\\<&\\> *//~' remove.txt) corpus.txt
this is amessage to check ifwords are removed correctly. The second line may or may not havewords. The third line also need not be as clean as first and second line.
There can be paragraphs in the text corpus and the entire file should be checked for.
Upvotes: 2
Reputation: 36873
You are doing
for r in words:
r = re.sub(r"[^\w\s]", "", r)
if not r in stop_words:
appendFile = open("filteredtext.txt", "a")
appendFile.write(" " + r)
appendFile.close()
but by doing that you are open
-ing for every word which is not in stop_words
, which takes time, it is more efficient to open
once, do what you want and then close
i.e.
appendFile = open("filteredtext.txt", "w")
for r in words:
r = re.sub(r"[^\w\s]", "", r)
if r not in stop_words:
appendFile.write(" " + r)
appendFile.close()
Note that I also used w
rite mode as we do not need to a
ppend in this case.
Upvotes: 0
Reputation: 133780
Could you please try following, written and tested with shown samples in GNU awk
. Will add detailed explanation in sometime too.
awk '
FNR==NR{
arr[$0]
next
}
{
for(i=1;i<=NF;i++){
if(count=="" && !($i in arr)){ count++ }
printf("%s%s",!($i in arr)?$i:"",(i==NF?"":(!($i in arr)?OFS:"")))
}
if(count){printf ORS;count=""}
}' remove.txt cor.txt
EDIT: As per OP's comment to get only words which are matching from another file try following.
awk '
FNR==NR{
arr[$0]
next
}
{
for(i=1;i<=NF;i++){
if(count=="" && ($i in arr)){ count++ }
printf("%s%s",($i in arr)?$i:"",(i==NF?"":(($i in arr)?OFS:"")))
}
if(count){printf ORS;count=""}
}' remove.txt cor.txt
Upvotes: 2