BioInformatician
BioInformatician

Reputation: 97

Merge fields according to some key using Python

I have a table looking like this:

 ID   sam1   sam2  sam3  sam4  sam5
 11   Yes 
 11                 Yes
 11
 22                      Yes
 22
 22                             Yes

I want the result to be something like this:

 ID   sam1   sam2  sam3  sam4  sam5
 11   Yes    NA     Yes   NA    NA
 22   NA     NA     NA   Yes    Yes

How can I do that using python?

Upvotes: 0

Views: 56

Answers (1)

BlackJack
BlackJack

Reputation: 4679

Parsing that free text table resulted in the longest function in my attempt to process the data:

#!/usr/bin/env python
from __future__ import absolute_import, division, print_function
import csv
import re
import sys
from itertools import chain, groupby, islice, izip_longest
from operator import itemgetter

NOT_WHITESPACE_RE = re.compile(r'\S+')


def parse_lines(lines):
    lines = iter(lines)
    first_line = next(lines)
    column_start_indices = [
        match.start() for match in NOT_WHITESPACE_RE.finditer(first_line)
    ]
    slices = [
        slice(i, j)
        for i, j in izip_longest(
            column_start_indices, islice(column_start_indices, 1, None)
        )
    ]
    return (
        [line[s].strip() for s in slices] for line in chain([first_line], lines)
    )


def merge_rows(row_a, row_b):
    return [a or b for a, b in izip_longest(row_a, row_b, fillvalue='')]


def aggregate_rows(rows):
    return (
        reduce(merge_rows, group, [])
        for _, group in groupby(rows, itemgetter(0))
    )


def replace_empty(rows):
    return ([x or 'NA' for x in row] for row in rows)


def process_table(lines):
    return replace_empty(aggregate_rows(parse_lines(lines)))


def main():
    with open('test.txt') as lines:
        writer = csv.writer(sys.stdout, delimiter='\t')
        writer.writerows(process_table(lines))


if __name__ == '__main__':
    main()

With the test input this writes the following to the output file:

ID      sam1    sam2    sam3    sam4    sam5
11      Yes     NA      Yes     NA      NA
22      NA      NA      NA      Yes     Yes

Upvotes: 1

Related Questions