Reputation: 305

How to delete subsets of lines based on keywords from a list?

I have the following file:

This
is
a
testfile 
wj5j keyword 1
WFEWF
O%LWJZ keyword 2
which
should
lpokpij keyword 3
123123das
kpmnvf keyword 4
just
contain
the 
following
lines.

from which I need to delete the subsets of lines between keyword 1 & keyword 2 as well as between keyword 3 & keyword 4, hence it would looks like that:

This
is
a
testfile 
which
should
just
contain
the 
following
lines.

I tried the following which prints only the lines of code containing the keywords, but not those lines in between. My idea was if I got all the lines printed, I could delete them from the file

with open ("newfile_TEST1.txt", mode = "r") as file:
    keywords = ['keyword 1', 'keyword 2','keyword 3','keyword 4']
    lines = file.readlines()
    for lineno, line in enumerate(file,1):
        matches = [k for k in keywords if k  in line]
        if matches:
            print(line)

What can I do to improve my code?

Upvotes: 1

Answers (4)

Shilpi Mishra

Reputation: 1

I have used split function of reindex.

Using which I have splitted the whole string in chunks. I have then picked only chunks with even place value as we are interested in data between 2 keywords. For eg: pair("keyword 1","keyword 2") and pair("keyword 3","keyword 4") etc. There were few empty lines(since we skipped odd place values) so just did rstrip() to remove empty lines.

import re
Lmatches=[]
Loutput=[]
patt=re.compile(r'\b.* keyword [1-4]')
with open("f1.txt","r") as f:
    data=f.read()
matches=patt.split(data)
for i in range(len(matches)):
    if i%2==0:
        Lmatches.append(matches[i])
for elem in Lmatches:
    Loutput.append(elem.rstrip())#to remove empty lines
with open("output.txt","w") as wfile:
    wfile.writelines(Loutput)

Upvotes: 0

qkzk

Reputation: 297

It's not really elegant, but you could do something like that :

with open("file.txt", mode="r") as file:
    lines = file.readlines()

keywords = ["keyword 1", "keyword 2", "keyword 3", "keyword 4"]
line = 0
to_keep = True
kept = []

while line < len(lines):
    has_keyword = any((keyword in lines[line] for keyword in keywords))
    if to_keep and not has_keyword:
        kept.append(lines[line])
    if has_keyword:
        to_keep = not to_keep
    line += 1

for line in kept:
    print(line, end="")


with open("newfile.txt", mode="w") as file:
    file.writelines(kept)

Output :

This
is
a
testfile
which
should
just
contain
the
following
lines.

Upvotes: 1

S.B

Reputation: 16556

This solution is for huge text files when you don't want to store the whole lines with readlines() or etc.

keywords = ['keyword 1', 'keyword 2', 'keyword 3', 'keyword 4']

keywords_it = iter(keywords)
pair = (next(keywords_it), next(keywords_it))
write = True

with open("newfile_TEST1.txt") as f:
    for line in f:
        if not line.rstrip().endswith(pair[0]) and write:
            print(line, end='')

        elif line.rstrip().endswith(pair[1]):
            write = True
            try:
                pair = (next(keywords_it), next(keywords_it))
            except StopIteration:
                pass
        else:
            write = False

output:

This
is
a
testfile 
which
should
just
contain
the 
following
lines.

The idea is we get a pair of keywords from the keywords list each time(like ('keyword 1', 'keyword 2'). While we're iterating over the lines in file, if the line is not ending with the first one, it is a normal line and should printed. If it ends with the first item in the pair, it set the write flag to False which means we stop writing.
Now if it ends with the second item in the pair, it means that we can start to write again after this line. So we get the next pair and set the write flag to True.

Upvotes: 1

EllipticBike38

Reputation: 111

I would use a flair that is True since the first match until the netx one. then is False:

with open ("./txt.txt", mode = "r") as file:
    keywords = ['keyword 1', 'keyword 2','keyword 3','keyword 4']
    lines = file.readlines()
    glitch_flair=False
    for lineno, line in enumerate(lines,1):
        matches = [k for k in keywords if k  in line]
        if not matches and not glitch_flair:
            print(line, end='')
        elif matches:
            glitch_flair=not glitch_flair

Upvotes: 1

How to delete subsets of lines based on keywords from a list?

Answers (4)

Related Questions