simitar
simitar

Reputation: 3

matching results in a list of lists

I have a list with the following structure;

[('0','927','928'),('2','693','694'),('2','742','743'),('2','776','777'),('2','804','805'),
('2','987','988'),('2','997','998'),('2','1019','1020'),
('2','1038','1039'),('2','1047','1048'),('2','1083','1084'),('2','659','660'),
('2','677','678'),('2','743','744'),('2','777','778'),('2','805','806'),('2','830','831')

the 1st number is an id, the second a position of a word and the third number is the position of a second word. What I need to do and am struggling with is finding sets of words next to each other.

These results are given for searches of 3 words, so there is the positions of word 1 with word 2 and positions of word 2 with word 3. For example ;

I run the phrase query "women in science" I then get the values given in the list above, so ('2','776','777') is the results for 'women in' and ('2','777','778') is the results for 'in science'.

I need to find a way to match these results up, so for every document it groups the words together depending on amounts of word in the query. (so if there is 4 words in the query there will be 3 results that need to be matched together).

Is this possible?

Upvotes: 0

Views: 100

Answers (2)

Blackbeard
Blackbeard

Reputation: 355

The following will do what you're asking, as I understand it - it's not the prettiest output in the world, and I think that if possible you should be using numbers if numbers are what you're trying to work with. There are probably more elegant solutions, and simplifications that could be made to this:

positions = [('0','927','928'),('2','693','694'),('2','742','743'),('2','776','777'),('2','804','805'),
('2','987','988'),('2','997','998'),('2','1019','1020'),
('2','1038','1039'),('2','1047','1048'),('2','1083','1084'),('2','659','660'),
('2','677','678'),('2','743','744'),('2','777','778'),('2','805','806'),('2','830','831')]

sorted_dict = {}
sorted_list = []
grouped_list = []
doc_ids = []

def sort_func(positions):
    for item in positions:
        if item[0] not in doc_ids:
            doc_ids.append(item[0])
    for doc_id in doc_ids:
        sorted_set = []
        for item in positions:
            if item[0] != doc_id:
                continue
            else:
                if item[1] not in sorted_set:
                    sorted_set.append(item[1])
                if item[2] not in sorted_set:
                    sorted_set.append(item[2])
        sorted_list = sorted(sorted_set)
        sorted_dict[doc_id] = sorted_list
    for k in sorted_dict:
        group = []
        grouped_list = []
        for i in sorted_dict[k]:
            try:
                if int(i)-1 == int(sorted_dict[k][sorted_dict[k].index(i)-1]):
                    group.append(i)
                else:
                    if group != []:
                        grouped_list.append(group)
                    group = [i]
            except IndexError:
                group.append(i)
                continue
        if grouped_list != []:
            sorted_dict[k] = grouped_list
        else:
            sorted_dict[k] = group
    return sorted_dict

My output for the above was:

{'0': ['927', '928'], '2': [['1019', '1020'], ['1038', '1039'], ['1047', '1048'], ['1083', '1084'], ['659', '660'], ['677', '678'], ['693', '694'], ['742', '743', '744'], ['776', '777', '778'], ['804', '805', '806'], ['830', '831'], ['987', '988']]}

Upvotes: 1

9000
9000

Reputation: 40904

You need to quickly find word info by its position. Create a dictionary keyed by word position:

# from your example; I wonder why you use strings and not numbers.
positions = [('0','927','928'),('2','693','694'),('2','742','743'),('2','776','777'),('2','804','805'),
('2','987','988'),('2','997','998'),('2','1019','1020'),
('2','1038','1039'),('2','1047','1048'),('2','1083','1084'),('2','659','660'),
('2','677','678'),('2','743','744'),('2','777','778'),('2','805','806'),('2','830','831')]

# create the dictionary
dict_by_position = {w_pos:(w_id, w_next) for (w_id, w_pos, w_next) in positions}

Now it's a piece of cake to follow chains:

>>> dict_by_position['776']
('2', '777')
>>> dict_by_position['777']
('2', '778')

Or programmatically:

def followChain(start, position_dict):
  result = []
  scanner = start
  while scanner in position_dict:
    next_item = position_dict[scanner]
    result.append(next_item)
    unused_id, scanner = next_item  # unpack the (id, next_position)
  return result

>>> followChain('776', dict_by_position)
[('2', '777'), ('2', '778')]

Finding all chains that are not subchains of each other:

seen_items = set()
for start in dict_by_position:
  if start not in seen_items:
    chain = followChain(start, dict_by_position)
    seen_items.update(set(chain))  # mark all pieces of chain as seen
    print chain  # or do something reasonable instead

Upvotes: 1

Related Questions