Dhruv Ghulati
Dhruv Ghulati

Reputation: 3026

Creating bigrams from a string using regex

I have a string like:

"[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"

Which is taken from an Excel file. This looks like an array, but because it is extracted from a file it is just a string.

What I need to do is:

a) Remove the [ ]

b) Split the string by , and thus actually create a new list

c) Take the first string only i.e. u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'

d) Create bigrams of the resulting string as an actual string spit by spaces (not as bigrams):

LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to *extend*to~prepc_according_to+expectancy~-nsubj expectancy~-nsubj+is~parataxis  is~parataxis+NUMBER~nsubj NUMBER~nsubj+NUMBER_SLOT

Current code snippets I have been playing around with.

text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub('^\[(.*)\]',"\1",text)
text = [text.split(",")[0]]
bigrams = [b for l in text for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
bigrams = ('').join(bigrams)

My regex though seems to return nothing.

Upvotes: 0

Views: 1042

Answers (2)

Laurent LAPORTE
Laurent LAPORTE

Reputation: 22952

Your string looks like a Python list of unicode strings, right?

You can evaluate it to get list of unicode string. A good way to do that is to use ast.literal_eval function from the ast module.

Simply write:

text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'," \
       " u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"

import ast

lines = ast.literal_eval(text)

The result is the list of unicode strings:

for line in  lines:
    print(line)

You'll get:

LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT    

To compute the bigrams:

bigrams = [b for l in lines for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = ["+".join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = ' '.join(map(str, bigrams))
bigrams = ''.join(bigrams)

Upvotes: 1

Dhruv Ghulati
Dhruv Ghulati

Reputation: 3026

I have solved this. The regex needs to go through twice to replace first the brackets, then get the first string, then remove quotes:

   text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
                        text =  re.sub(r'\[u|\]',"",text)
                        text = text.split(",")[0]
                        text = re.sub(r'\'',"",text)
                        text = text.split("+")
                        bigrams = [text[i:i+2] for i in xrange(len(text)-2)]
                        bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
                        bigrams = (' ').join(map(str, bigrams))

Upvotes: 0

Related Questions