Trying to read certain text from a file that has strange characters. (Python)

Question

Hello I'm trying to grab data from a keyword in a text document as a project, I am able to do this using this code. I am very new to python and im not sure where to start to troubleshoot this issue.

data_file = open("test.txt", "r")

Keyword = raw_input("Please enter the keyword: ")

go = False

start = Keyword
end = "[+][+]"

with open("test.txt") as infile:
    for line in infile:
        line = line.strip()
        if start in line: go = True
        elif end in line:
            go = False
            continue
        if go:
            print(line)

This code works great for a sample text document like

Something Something Something Something   
Something Something Something Something  
Something Keyword:  
 Data  
 Data  
 Data  
 Data  
End  
 Something

However i run into an issue when trying to read from a file that has strange characters. for example:

2015/08/14 15:48:30 OUT:
2015/08/14 15:48:30 OUT:
 PQ=
(3<   ’’aÈ©ÿY˜ü   â     [+][+]52

2015/08/14 15:48:31:IN[+]53[+][+]101[+]-1[+] **Keyword** ,SHOWALL
**data**
**data**
**data**
**data**
**data**
**data**
**data**
end

Since the goal is to read from this text document and just print out the words in between the Keyword and End. it will not execute if it has these characters in them. and for the project I can not remove these characters it just has to be able to read through the document and find the keyword and print out whats in between.

Any ideas on how i can read from a text document that has these strange characters with it processing it correctly rather than just crashing.

Martin Evans · Accepted Answer

First you need to open the file in binary mode. You could then use a regular expression to extract all the text between your entered keyword and "end". Whole words could then be extracted using another regular expression:

import re

with open("input.txt", "rb") as f_input:     
    start_token = raw_input("Please enter the start keyword: ")
    end_token = raw_input("Please enter the end keyword: ")
    reText = re.search("%s(.*?)%s" % (re.escape(start_token), re.escape(end_token)), f_input.read(), re.S)

    if reText:
        for word in re.findall(r"\b\w+\b", reText.group(1)):
            print word
    else:
        print "not found"

For your example text this would display:

SHOWALL
data
data
data
data
data
data
data

Or if you just want all of the text between the two points, print reText.group(1) instead of the for loop.

Updated: added support for a variable end token.

Trying to read certain text from a file that has strange characters. (Python)

Answers (2)

Related Questions