Reputation: 3026
I have a string like:
"[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
Which is taken from an Excel file. This looks like an array, but because it is extracted from a file it is just a string.
What I need to do is:
a) Remove the [ ]
b) Split the string by ,
and thus actually create a new list
c) Take the first string only i.e. u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'
d) Create bigrams of the resulting string as an actual string spit by spaces (not as bigrams):
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to *extend*to~prepc_according_to+expectancy~-nsubj expectancy~-nsubj+is~parataxis is~parataxis+NUMBER~nsubj NUMBER~nsubj+NUMBER_SLOT
Current code snippets I have been playing around with.
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub('^\[(.*)\]',"\1",text)
text = [text.split(",")[0]]
bigrams = [b for l in text for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
bigrams = ('').join(bigrams)
My regex though seems to return nothing.
Upvotes: 0
Views: 1042
Reputation: 22952
Your string looks like a Python list of unicode strings, right?
You can evaluate it to get list of unicode string. A good way to do that is to use ast.literal_eval
function from the ast module.
Simply write:
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'," \
" u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
import ast
lines = ast.literal_eval(text)
The result is the list of unicode strings:
for line in lines:
print(line)
You'll get:
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
To compute the bigrams:
bigrams = [b for l in lines for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = ["+".join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = ' '.join(map(str, bigrams))
bigrams = ''.join(bigrams)
Upvotes: 1
Reputation: 3026
I have solved this. The regex needs to go through twice to replace first the brackets, then get the first string, then remove quotes:
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub(r'\[u|\]',"",text)
text = text.split(",")[0]
text = re.sub(r'\'',"",text)
text = text.split("+")
bigrams = [text[i:i+2] for i in xrange(len(text)-2)]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
Upvotes: 0