fff
fff

Reputation: 111

Creation of bigrams in python

I have a list of candidate bilingual terms extracted from the parallel corpus, in this format

Difensori dei diritti umani, libertà di espressione >>> Human rights defenders, freedom of expression

What I want is to associate in bigrams the item in both languages. So every multi word term in the source language (Italian) will be associated in bigram with every multi word term in the target language (English). So for the example the bigrams will look like this:

('Difensori dei diritti umani','Human rights defenders')
('Difensori dei diritti umani','freedom of expression')
('libertà di espressione','Human rights defenders')
('libertà di espressione','freedom of expression')

Can someone help?

Upvotes: 0

Views: 156

Answers (3)

Philippe Oger
Philippe Oger

Reputation: 1097

You need a bit of wrangling to get what you need. If you only want translation based tuples, based on your example, you could use the following function:

# -*- coding: utf-8 -*-

def zipping(string):
    string = string.replace(', ', ',')   # to take away parasite spaces
    string = string.split(" >>> ")
    trans_tuples = zip(string[0].split(','), string[1].split(','))
    return trans_tuples

str = "Difensori dei diritti umani, libertà di espressione >>> Human rights defenders, freedom of expression"
for bigram in zipping(str):
    print bigram

Output will be:

('Difensori dei diritti umani', 'Human rights defenders')
('libertà di espressione', 'freedom of expression')

If you need to associate both terms in one side, with both terms from the other side (for context purposes I suppose), just adjust the zipping function as follow:

# -*- coding: utf-8 -*-

def zipping(string):
    string = string.replace(', ', ',')
    string = string.split(" >>> ")
    trans_tuples = zip(string[0].split(','), string[1].split(','))
    trans_tuples.append((trans_tuples[0][0], trans_tuples[1][1]))  # new line 1
    trans_tuples.append((trans_tuples[1][0], trans_tuples[0][1]))  # new line 2
    return trans_tuples

str = "Difensori dei diritti umani, libertà di espressione >>> Human rights defenders, freedom of expression"
for bigram in zipping(str):
    print bigram

In that case, the output would be as follow:

('Difensori dei diritti umani', 'Human rights defenders')
('libertà di espressione', 'freedom of expression')
('Difensori dei diritti umani', 'freedom of expression')
('libertà di espressione', 'Human rights defenders')

Upvotes: 1

Erwan Vasseure
Erwan Vasseure

Reputation: 36

Is that what your are looking for ?

s = "Difensori dei diritti umani, liberta di espressione >>> Human rights defenders, freedom of expression"

bigrams = []
trans = s.split(' >>> ')
for it in trans[0].split(', '):
    for en in trans[1].split(', '):
        bigrams.append((it, en))
        print((it, en))

It produces this output:

('Difensori dei diritti umani', 'Human rights defenders')  
('Difensori dei diritti umani', 'freedom of expression')  
('liberta di espressione', 'Human rights defenders')   
('liberta di espressione','freedom of expression') 

Upvotes: 1

ilyakhov
ilyakhov

Reputation: 1319

My solution:

str = "Difensori dei diritti umani, libertà di espressione >>> Human rights defenders, freedom of expression"
b = [elem.split(", ") for elem in str.split(" >>> ")]
bigrams = list(zip(b[0], b[1]))
bigrams_ = list((zip(reversed(b[0]), b[1])))
bigrams = bigrams + bigrams_
for bigram in bigrams:
    print(bigram)

Output:

('Difensori dei diritti umani', 'Human rights defenders')
('libertà di espressione', 'freedom of expression')
('libertà di espressione', 'Human rights defenders')
('Difensori dei diritti umani', 'freedom of expression')

Upvotes: 0

Related Questions