Reputation: 111
I have a list of candidate bilingual terms extracted from the parallel corpus, in this format
Difensori dei diritti umani, libertà di espressione >>> Human rights defenders, freedom of expression
What I want is to associate in bigrams the item in both languages. So every multi word term in the source language (Italian) will be associated in bigram with every multi word term in the target language (English). So for the example the bigrams will look like this:
('Difensori dei diritti umani','Human rights defenders')
('Difensori dei diritti umani','freedom of expression')
('libertà di espressione','Human rights defenders')
('libertà di espressione','freedom of expression')
Can someone help?
Upvotes: 0
Views: 156
Reputation: 1097
You need a bit of wrangling to get what you need. If you only want translation based tuples, based on your example, you could use the following function:
# -*- coding: utf-8 -*-
def zipping(string):
string = string.replace(', ', ',') # to take away parasite spaces
string = string.split(" >>> ")
trans_tuples = zip(string[0].split(','), string[1].split(','))
return trans_tuples
str = "Difensori dei diritti umani, libertà di espressione >>> Human rights defenders, freedom of expression"
for bigram in zipping(str):
print bigram
Output will be:
('Difensori dei diritti umani', 'Human rights defenders')
('libertà di espressione', 'freedom of expression')
If you need to associate both terms in one side, with both terms from the other side (for context purposes I suppose), just adjust the zipping function as follow:
# -*- coding: utf-8 -*-
def zipping(string):
string = string.replace(', ', ',')
string = string.split(" >>> ")
trans_tuples = zip(string[0].split(','), string[1].split(','))
trans_tuples.append((trans_tuples[0][0], trans_tuples[1][1])) # new line 1
trans_tuples.append((trans_tuples[1][0], trans_tuples[0][1])) # new line 2
return trans_tuples
str = "Difensori dei diritti umani, libertà di espressione >>> Human rights defenders, freedom of expression"
for bigram in zipping(str):
print bigram
In that case, the output would be as follow:
('Difensori dei diritti umani', 'Human rights defenders')
('libertà di espressione', 'freedom of expression')
('Difensori dei diritti umani', 'freedom of expression')
('libertà di espressione', 'Human rights defenders')
Upvotes: 1
Reputation: 36
Is that what your are looking for ?
s = "Difensori dei diritti umani, liberta di espressione >>> Human rights defenders, freedom of expression"
bigrams = []
trans = s.split(' >>> ')
for it in trans[0].split(', '):
for en in trans[1].split(', '):
bigrams.append((it, en))
print((it, en))
It produces this output:
('Difensori dei diritti umani', 'Human rights defenders')
('Difensori dei diritti umani', 'freedom of expression')
('liberta di espressione', 'Human rights defenders')
('liberta di espressione','freedom of expression')
Upvotes: 1
Reputation: 1319
My solution:
str = "Difensori dei diritti umani, libertà di espressione >>> Human rights defenders, freedom of expression"
b = [elem.split(", ") for elem in str.split(" >>> ")]
bigrams = list(zip(b[0], b[1]))
bigrams_ = list((zip(reversed(b[0]), b[1])))
bigrams = bigrams + bigrams_
for bigram in bigrams:
print(bigram)
Output:
('Difensori dei diritti umani', 'Human rights defenders')
('libertà di espressione', 'freedom of expression')
('libertà di espressione', 'Human rights defenders')
('Difensori dei diritti umani', 'freedom of expression')
Upvotes: 0