Reputation: 1610
everyone, I have a big file in the format given below. The data is in the "block" format. one "block" containing three rows: the time T, the user U, and the content W. for example, this is a block:
T 2009-06-11 21:57:23
U tracygazzard
W David Letterman is good man
Since i will only using the block containing specific key word. I slice the data from the original massive data block by block, rather than dump the whole data into memory. each time read in one block, and if the row of content containing the word of "bike", write this block into disk.
you can use the following two blocks to test your script.
T 2009-06-11 21:57:23
U tracygazzard
W David Letterman is good man
T 2009-06-11 21:57:23
U charilie
W i want a bike
I have tried to do the work line by line:
data = open("OWS.txt", 'r')
output = open("result.txt", 'w')
for line in data:
if line.find("bike")!= -1:
output.write(line)
Upvotes: 0
Views: 421
Reputation: 35269
As the format of your blocks is constant, you can use a list to hold a block, then see if bike
is in that block:
data = open("OWS.txt", 'r')
output = open("result.txt", 'w')
chunk = []
for line in data:
chunk.append(line)
if line[0] == 'W':
if 'bike' in str(chunk):
for line in chunk:
output.write(line)
chunk = []
Upvotes: 1
Reputation: 336158
You can use regular expressions:
import re
data = open("OWS.txt", 'r').read() # Read the entire file into a string
output = open("result.txt", 'w')
for match in re.finditer(
r"""(?mx) # Verbose regex, ^ matches start of line
^T\s+(?P<T>.*)\s* # Match first line
^U\s+(?P<U>.*)\s* # Match second line
^W\s+(?P<W>.*)\s* # Match third line""",
data):
if "bike" in match.group("W"):
output.write(match.group()) # outputs entire match
Upvotes: 1