Charon
Charon

Reputation: 2364

Python Regex Alphabet and spaces

I have a file that contains random, junk ascii characters.

However, in the file there is also a message written in english.

Like this:

...˜ÃÕ=òaãNÜ ß§#üxwáã MESSAGE HIDDEN IN HERE ŸÎ=N‰çÈ^XvU…”vN˜...

I'm trying to write a python regex which will look for a pattern beginning with 6 letters or spaces and ending with 6 letters of spaces.

that way, as long as the message is a minimum of characters or spaces long, then it should output the message.

This is what I've come up with, but it doesn't seem to be working.

regex = re.compile('''
([A-Z ]){6,}                                        
([A-Z ]){6,}              
''', re.I | re.X )

Upvotes: 1

Views: 10352

Answers (3)

Deelaka
Deelaka

Reputation: 13693

Your Regex:

([A-Z ]){6,}                                        
([A-Z ]){6,}

Doesn't work because, As you can see it expects quite a lot of spaces between the two groups:

Regular expression visualization


Was this what you were looking for:

import re

reg = re.compile( "[A-Z ]{6,}[A-Z ]{6,}")
string = "...˜ÃÕ=òaãNÜ ß§#üxwáã MESSAGE HIDDEN IN HERE ŸÎ=N‰çÈ^XvU…”vN˜..."

print reg.findall(string)

Output:

[' MESSAGE HIDDEN IN HERE ']

Upvotes: 4

Jon
Jon

Reputation: 12874

Try the following regular expression. Using your example I only needed to check one group:

import re
pattern_obj = re.compile('[a-zA-Z ]{6,}', re.I)
extracted_patterns = pattern_obj.findall(ur'your_string')
print extracted_patterns

From your Stackoverflow tag - I assume that you use Python 2. In such a case you have to take care that the string read in is unicode.

Output

[u' MESSAGE HIDDEN IN HERE ']

General recommendation: Sometimes it can be difficult to find a good regular expression. The mostly unknown flag re.DEBUG can be very useful in this case.

pattern_obj = re.compile('[a-zA-Z ]{6,}', re.DEBUG)
max_repeat 6 4294967295
  in
    range (97, 122)
    range (65, 90)
    literal 32

Upvotes: 2

PepperoniPizza
PepperoniPizza

Reputation: 9102

import re
word = re.compile('[a-zA-Z\s]{6,}.+[[a-zA-Z\s]{6,}]')

filein = open(filename, 'rb).read()
print re.findall(word, filein)

Upvotes: 0

Related Questions