Edward Coelho
Edward Coelho

Reputation: 235

Python algorithm to obtain a randomized negative dataset from a positive dataset

I have a file containing unique pairs of proteins, the positive dataset. Let's call it infile. Below there's an example of the infile content:

Q9VRA8  A1ZBB4
Q03043  Q9VX24
B6VQA0  Q7KML2

The entries are tab separated. The randomized dataset, let's call it outfile, must contain combinations of the individual proteins, in a way that they cannot match the content of the infile in any order. As an example, for the first line above, the randomized outfile cannot contain the following pairs:

Q9VRA8  A1ZBB4
A1ZBB4  Q9VRA8

Also, the generated negative dataset must contain the exact same number of protein pairs in the positive dataset. In order to adress this I tried the following:

# Read original file
data = list(infile.readlines())
ltotal = len(data)
lwritten = 0

# Split original file in words
with open (infilename, 'rt') as infile:
    pairs = set(frozenset(line.split()) for line in infile)
words = list(itertools.chain.from_iterable(pairs))
random.shuffle(words)

# Obtain pairs of words
with open(outfilename, 'wt') as outfile:
    for pair in itertools.izip(*[iter(words)] * 2):
        if frozenset(pair) not in pairs and lwritten != ltotal:
            outfile.write("%s\t%s\n" % pair)
            lwritten += 1

This works. However, the infile has a total of 856471 lines and the outfile obtains different ranges of proteins pairs, with a minimum of 713000.

How can I work around this so the number of pairs generated is the exact same as the infile? Also, I couldn't adress the reverse pair order issue. Any thoughts in both questions?

Thanks in advance.

Upvotes: 4

Views: 428

Answers (1)

Dave
Dave

Reputation: 8090

To veto pairs independent of order, I'd just put both order into my list of pairs: i.e. I'd add: line.split() and line.split()[::-1] to the set of pairs.

To generate more pairs, instead of iterating through the list of words, just pick random pairs (using random.choice maybe?) and then vetoing them based on the list of invalid pairs (you may also need to consider the case where you generate the pair "A1ZBB4 A1ZBB4" and act appropriately). You can just keep doing this as long as you like. Since you need to ensure that the output contains only unique elements, the output items can be added to the veto list (or maintained as a separate veto list) as they are generated.

If you want to reduce the memory footprint you could set up:

  • pairs is the set of pairs to veto, but each pair is internally sorted, i.e. if you read "Q9VRA8 A1ZBB4" you store it as the pair "A1ZBB4, Q9VRA8".
  • you generate random pairs above, check to see if the sorted version of that pair is in your veto list, if so ignore it.

Upvotes: 3

Related Questions