Wunter
Wunter

Reputation: 59

Finding word context with regular expressions

I have created a function to search for the contexts of a given word(w) in a text, with left and right as parameters for flexibility in the number of words to record.

import re
def get_context (text, w, left, right):
    text.insert (0, "*START*")
    text.append ("*END*")

    all_contexts = []

    for i in range(len(text)):

        if re.match(w,text[i], 0):

            if i < left:
                context_left = text[:i]

            else:
                context_left = text[i-left:i]

            if len(text) < (i+right):
                context_right = text[i:]

            else: 
                context_right = text[i:(i+right+1)]

            context = context_left + context_right

            all_contexts.append(context)
    return all_contexts

So for example if a have a text in the form of a list like this:

text = ['Python', 'is', 'dynamically', 'typed', 'language', 'Python', 'functions', 'really', 'care', 'about', 'what', 'you', 'pass', 'to', 'them', 'but', 'you', 'got', 'it', 'the', 'wrong', 'way', 'if', 'you', 'want', 'to', 'pass', 'one', 'thousand', 'arguments', 'to', 'your', 'function', 'then', 'you', 'can', 'explicitly', 'define', 'every', 'parameter', 'in', 'your', 'function', 'definition', 'and', 'your', 'function', 'will', 'be', 'automagically', 'able', 'to', 'handle', 'all', 'the', 'arguments', 'you', 'pass', 'to', 'them', 'for', 'you']

The function works fine for example:

get_context(text, "function",2,2)
[['language', 'python', 'functions', 'really', 'care'], ['to', 'your', 'function', 'then', 'you'], ['in', 'your', 'function', 'definition', 'and'], ['and', 'your', 'function', 'will', 'be']]

Now I am trying to build a dictionary with the contexts of every word in the text doing the following:

d = {}
for w in set(text):
    d[w] = get_context(text,w,2,2)

But I am getting this error.

Traceback (most recent call last):
  File "<pyshell#32>", line 2, in <module>
    d[w] = get_context(text,w,2,2)
  File "<pyshell#20>", line 9, in get_context
    if re.match(w,text[i], 0):
  File "/usr/lib/python3.4/re.py", line 160, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.4/re.py", line 294, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.4/sre_compile.py", line 568, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.4/sre_parse.py", line 760, in parse
    p = _parse_sub(source, pattern, 0)
  File "/usr/lib/python3.4/sre_parse.py", line 370, in _parse_sub
    itemsappend(_parse(source, state))
  File "/usr/lib/python3.4/sre_parse.py", line 579, in _parse
    raise error("nothing to repeat")
sre_constants.error: nothing to repeat

I don't understand this error. Can anyone help me with this?

Upvotes: 0

Views: 1295

Answers (3)

C Panda
C Panda

Reputation: 3405

The whole thing can be re-written very succinctly follows,

text = 'Python is dynamically typed language Python functions really care about what you pass to them but you got it the wrong way if you want to pass one thousand arguments to your function then you can explicitly define every parameter in your function definition and your function will be automagically able to handle all the arguments you pass to them for you'

Keeping it a str, assuming context = 'function',

pat = re.compile(r'(\w+\s\w+\s)functions?(?=(\s\w+\s\w+))')
pat.findall(text)
[('language Python ', ' really care'),
 ('to your ', ' then you'),
 ('in your ', ' definition and'),
 ('and your ', ' will be')]

Now, minor customization will be needed in the regex to allow for, words like say, functional or functioning not only function or functions. But the important idea is to do away with indexing and go more functional.

Please comment if this doesn't work out for you, when you apply it in bulk.

Upvotes: 1

malbarbo
malbarbo

Reputation: 11177

The problem is that "*START*" and "*END*" are being interpreted as regex. Also, note that inserting "*START*" and "*END*" in text in the begging of the function will cause problem. You should do it just once.

Here is a complete version of the working code:

import re

def get_context(text, w, left, right):
    all_contexts = []
    for i in range(len(text)):
        if re.match(w,text[i], 0):
            if i < left:
                context_left = text[:i]
            else:
                context_left = text[i-left:i]
            if len(text) < (i+right):
                context_right = text[i:]
            else:
                context_right = text[i:(i+right+1)]
            context = context_left + context_right
            all_contexts.append(context)
    return all_contexts

text = ['Python', 'is', 'dynamically', 'typed', 'language',
        'Python', 'functions', 'really', 'care', 'about', 'what',
        'you', 'pass', 'to', 'them', 'but', 'you', 'got', 'it', 'the',
        'wrong', 'way', 'if', 'you', 'want', 'to', 'pass', 'one',
        'thousand', 'arguments', 'to', 'your', 'function', 'then',
        'you', 'can', 'explicitly', 'define', 'every', 'parameter',
        'in', 'your', 'function', 'definition', 'and', 'your',
        'function', 'will', 'be', 'automagically', 'able', 'to', 'handle',
        'all', 'the', 'arguments', 'you', 'pass', 'to', 'them', 'for', 'you']

text.insert(0, "START")
text.append("END")

d = {}
for w in set(text):
    d[w] = get_context(text,w,2,2)

Maybe you can replace re.match(w,text[i], 0) with w == text[i].

Upvotes: 2

L3viathan
L3viathan

Reputation: 27283

At least one of the elements in text contains characters that are special in a regular expression. If you're just trying to find whether the word is in the string, just use str.startswith, i.e.

if text[i].startswith(w):  # instead of re.match(w,text[i], 0):

But I don't understand why you are checking for that anyways, and not for equality.

Upvotes: 0

Related Questions