Reputation: 97
I have a table looking like this:
ID sam1 sam2 sam3 sam4 sam5
11 Yes
11 Yes
11
22 Yes
22
22 Yes
I want the result to be something like this:
ID sam1 sam2 sam3 sam4 sam5
11 Yes NA Yes NA NA
22 NA NA NA Yes Yes
How can I do that using python?
Upvotes: 0
Views: 56
Reputation: 4679
Parsing that free text table resulted in the longest function in my attempt to process the data:
#!/usr/bin/env python
from __future__ import absolute_import, division, print_function
import csv
import re
import sys
from itertools import chain, groupby, islice, izip_longest
from operator import itemgetter
NOT_WHITESPACE_RE = re.compile(r'\S+')
def parse_lines(lines):
lines = iter(lines)
first_line = next(lines)
column_start_indices = [
match.start() for match in NOT_WHITESPACE_RE.finditer(first_line)
]
slices = [
slice(i, j)
for i, j in izip_longest(
column_start_indices, islice(column_start_indices, 1, None)
)
]
return (
[line[s].strip() for s in slices] for line in chain([first_line], lines)
)
def merge_rows(row_a, row_b):
return [a or b for a, b in izip_longest(row_a, row_b, fillvalue='')]
def aggregate_rows(rows):
return (
reduce(merge_rows, group, [])
for _, group in groupby(rows, itemgetter(0))
)
def replace_empty(rows):
return ([x or 'NA' for x in row] for row in rows)
def process_table(lines):
return replace_empty(aggregate_rows(parse_lines(lines)))
def main():
with open('test.txt') as lines:
writer = csv.writer(sys.stdout, delimiter='\t')
writer.writerows(process_table(lines))
if __name__ == '__main__':
main()
With the test input this writes the following to the output file:
ID sam1 sam2 sam3 sam4 sam5
11 Yes NA Yes NA NA
22 NA NA NA Yes Yes
Upvotes: 1