Reputation: 49
I am fairly new to Python.
I have a .txt file with almost ~500k lines of text. The general structure is like this:
WARC-TREC-ID:
hello
my
name
is
WARC-TREC-ID:
example
text
WARC-TREC-ID:
I would like to extract all contents in between the "WARC-TREC-ID:" keywords.
This is what I have already tried:
content_list = []
with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', errors = 'ignore') as openfile2:
for line in openfile2:
for item in line.split("WARC-TREC-ID:"):
if "WARC-TREC-ID:" in item:
content = (item [ item.find("WARC-TREC-ID:")+len("WARC-TREC-ID:") : ])
content_list.append(content)
this returns an empty list.
I have also tried:
import re
with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', 'r') as openfile3:
m = re.search('WARC-TREC-ID:(.+?)WARC-TREC-ID:', openfile3)
if m:
found = m.group(1)
and this causes a TypeError: expected string or bytes-like object
Upvotes: 0
Views: 106
Reputation: 5740
For file that contains you data:
raw_data = open('data.txt', 'r').read()
result = [x for x in raw_data.split() if x != 'WARC-TREC-ID:']
Output:
['hello', 'my', 'name', 'is', 'example', 'text']
Upvotes: -1
Reputation: 82785
Try:
content_list = []
with open(filename) as infile:
for line in infile: #Iterate each line
if 'WARC-TREC-ID:' in line: #check if line contains 'WARC-TREC-ID:'
content_list.append([]) #Append empty list
else:
content_list[-1].append(line) #Append content
print(content_list)
Upvotes: 2
Reputation: 1684
In your second approach, you should pass your file content as string
as it expects a string argument, not file. And this too, will only return the first occurrence of that string. You might want to use findall.
Upvotes: 0