Reputation: 507
Hello I'm trying to grab data from a keyword in a text document as a project, I am able to do this using this code. I am very new to python and im not sure where to start to troubleshoot this issue.
data_file = open("test.txt", "r")
Keyword = raw_input("Please enter the keyword: ")
go = False
start = Keyword
end = "[+][+]"
with open("test.txt") as infile:
for line in infile:
line = line.strip()
if start in line: go = True
elif end in line:
go = False
continue
if go:
print(line)
This code works great for a sample text document like
Something Something Something Something
Something Something Something Something
Something Keyword:
Data
Data
Data
Data
End
Something
However i run into an issue when trying to read from a file that has strange characters. for example:
2015/08/14 15:48:30 OUT:
2015/08/14 15:48:30 OUT:
PQ=
(3< ’’aÈ©ÿY˜ü â [+][+]52
2015/08/14 15:48:31:IN[+]53[+][+]101[+]-1[+] **Keyword** ,SHOWALL
**data**
**data**
**data**
**data**
**data**
**data**
**data**
end
Since the goal is to read from this text document and just print out the words in between the Keyword and End. it will not execute if it has these characters in them. and for the project I can not remove these characters it just has to be able to read through the document and find the keyword and print out whats in between.
Any ideas on how i can read from a text document that has these strange characters with it processing it correctly rather than just crashing.
Upvotes: 1
Views: 1371
Reputation: 46779
First you need to open the file in binary mode. You could then use a regular expression to extract all the text between your entered keyword and "end". Whole words could then be extracted using another regular expression:
import re
with open("input.txt", "rb") as f_input:
start_token = raw_input("Please enter the start keyword: ")
end_token = raw_input("Please enter the end keyword: ")
reText = re.search("%s(.*?)%s" % (re.escape(start_token), re.escape(end_token)), f_input.read(), re.S)
if reText:
for word in re.findall(r"\b\w+\b", reText.group(1)):
print word
else:
print "not found"
For your example text this would display:
SHOWALL
data
data
data
data
data
data
data
Or if you just want all of the text between the two points, print reText.group(1)
instead of the for
loop.
Updated: added support for a variable end token.
Upvotes: 2
Reputation: 3660
The file contains binary content so it should be opened in binary mode
You can do this by doing
data_file = open("test.txt", "rb")
Upvotes: 1