Reputation: 39
I have a nested list:
output= [('the', 'B', 'NNP'), ('wall', 'I', 'NNP'), ('street', 'I', 'NNP'), ('journal', 'I', 'NNP'), ('reported', 'O', 'VB'), ('today', 'O', 'NNP'), ('that', 'O', 'NNP'), ('apple', 'B', 'NNP'), ('corporation', 'I', 'NNP'), ('made', 'O', 'VB'), ('money', 'O', 'NNP'), ('.', 'O', '.'), ('georgia', 'B', 'NNP'), ('tech', 'I', 'NNP'), ('is', 'O', 'NNP'), ('a', 'O', '.'), ('university', 'O', 'NNP'), ('in', 'O', 'NNP'), ('georgia', 'B', 'NNP'),('.', 'O', '.')]
I want to re-format it to the following expected format:
new_output= [(['the', 'wall', 'street', 'journal', 'reported', 'today', 'that', 'apple', 'corporation', 'made', 'money'], ['B', 'I', 'I', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O']), (['georgia', 'tech', 'is', 'a', 'university', 'in', 'georgia'], ['B', 'I', 'O', 'O', 'O', 'O', 'B'])]
My attempt is:
import string
word = []
token = []
result_word = []
result_token = []
result = []
for i in output[0]:
for every_word in i:
word.append(every_word)
result_word = " ".join(" ".join(word).split()[::3])
How can I get my expected format?
Upvotes: 0
Views: 91
Reputation: 4375
output = [('the', 'B', 'NNP'), ('wall', 'I', 'NNP'), ('street', 'I', 'NNP'), ('journal', 'I', 'NNP'), ('reported', 'O', 'VB'), ('today', 'O', 'NNP'), ('that', 'O', 'NNP'), ('apple', 'B', 'NNP'), ('corporation', 'I', 'NNP'), ('made', 'O', 'VB'), ('money', 'O', 'NNP'), ('.', 'O', '.'), ('georgia', 'B', 'NNP'), ('tech', 'I', 'NNP'), ('is', 'O', 'NNP'), ('a', 'O', '.'), ('university', 'O', 'NNP'), ('in', 'O', 'NNP'), ('georgia', 'B', 'NNP'),('.', 'O', '.')]
result, words, tokens = [], [], []
for word, token, _ in output: # this is tuple like ('the', 'B', 'NNP')
if word == '.': # end of sentence, save current and start new one
result.append((words, tokens))
words, tokens = [], []
else: # add new word to current sentence
words.append(word)
tokens.append(token)
print(result)
Output:
[(['the', 'wall', 'street', 'journal', 'reported', 'today', 'that', 'apple', 'corporation', 'made', 'money'], ['B', 'I', 'I', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O']), (['georgia', 'tech', 'is', 'a', 'university', 'in', 'georgia'], ['B', 'I', 'O', 'O', 'O', 'O', 'B'])]
Upvotes: 1
Reputation: 61910
You could do something like this:
from itertools import groupby
from operator import itemgetter
output = [('the', 'B', 'NNP'), ('wall', 'I', 'NNP'), ('street', 'I', 'NNP'), ('journal', 'I', 'NNP'),
('reported', 'O', 'VB'), ('today', 'O', 'NNP'), ('that', 'O', 'NNP'), ('apple', 'B', 'NNP'),
('corporation', 'I', 'NNP'), ('made', 'O', 'VB'), ('money', 'O', 'NNP'), ('.', 'O', '.'),
('georgia', 'B', 'NNP'), ('tech', 'I', 'NNP'), ('is', 'O', 'NNP'), ('a', 'O', '.'),
('university', 'O', 'NNP'), ('in', 'O', 'NNP'), ('georgia', 'B', 'NNP'), ('.', 'O', '.')]
sentences = [list(group) for k, group in groupby(output, lambda x: x[0] == ".") if not k]
result = [tuple(map(list, zip(*map(itemgetter(0, 1), sentence)))) for sentence in sentences]
print(result)
Output
[(['the', 'wall', 'street', 'journal', 'reported', 'today', 'that', 'apple', 'corporation', 'made', 'money'], ['B', 'I', 'I', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O']), (['georgia', 'tech', 'is', 'a', 'university', 'in', 'georgia'], ['B', 'I', 'O', 'O', 'O', 'O', 'B'])]
Explanation
As far as I understood you want to unpack the first and the last element of each of the sentences.
The line:
sentences = [list(group) for k, group in groupby(output, lambda x: x[0] == ".") if not k]
splits output
into sentences by each .
, the second line just unpacks each sentence:
result = [tuple(map(list, zip(*map(itemgetter(0, 1), sentence)))) for sentence in sentences]
As you want to a list of tuple of lists and zip returns a list of tuples you have to map each tuple with list and then convert the result of map to a tuple.
Upvotes: 2
Reputation: 61014
You can use groupby
to group the non-period items into sentences, then use zip
to split the words from the parts of speech indicators:
from itertools import groupby
l = output= [('the', 'B', 'NNP'), ('wall', 'I', 'NNP'), ('street', 'I', 'NNP'), ('journal', 'I', 'NNP'), ('reported', 'O', 'VB'), ('today', 'O', 'NNP'), ('that', 'O', 'NNP'), ('apple', 'B', 'NNP'), ('corporation', 'I', 'NNP'), ('made', 'O', 'VB'), ('money', 'O', 'NNP'), ('.', 'O', '.'), ('georgia', 'B', 'NNP'), ('tech', 'I', 'NNP'), ('is', 'O', 'NNP'), ('a', 'O', '.'), ('university', 'O', 'NNP'), ('in', 'O', 'NNP'), ('georgia', 'B', 'NNP'),('.', 'O', '.')]
groups = (g for k, g in groupby(l, lambda x: x[0] != '.') if k)
zs = (zip(*g) for g in groups)
res = [(next(z), next(z)) for z in zs]
res
is then
[(('the', 'wall', 'street', 'journal', 'reported', 'today', 'that', 'apple', 'corporation', 'made', 'money'),
('B', 'I', 'I', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O')),
(('georgia', 'tech', 'is', 'a', 'university', 'in', 'georgia'),
('B', 'I', 'O', 'O', 'O', 'O', 'B'))
]
Upvotes: 2