Reputation: 1521
I have an input file like this:
structureId chainId resolution uniprotAcc structureMolecularWeight
101M A 2.07 P02185 18112.8
102L A 1.74 P00720 18926.61
103D A 7502.93
103D B 7502.93
103L A 1.9 P00720 19092.72
103M A 2.07 P02185 18093.78
104L A 2.8 P00720 37541.04
104L B 2.8 P00720 37541.04
104M A 1.71 P02185 18030.63
104M A 3.1 P09323 2312.2
I want the output to look like this:
structureId chainId resolution uniprotAcc structureMolecularWeight
101M A 2.07 P02185 18112.8
102L A 1.74 P00720 18926.61
103D A 7502.93
103D B 7502.93
103L A 1.9 P00720 19092.72
103M A 2.07 P02185 18093.78
104L A,B 2.8 P00720 37541.04
104M A 1.71 P02185 18030.63
104M A 3.1 P09323 2312.2
i.e if col 'uniprotAcc' is the same for col 'structureId'; to combine them.
I wrote this code:
import sys
set_of_ids = list(set([line.strip().split('\t')[0] for line in open(sys.argv[1])]))
master_dict = {}
for line in open(sys.argv[1]):
split_line = line.strip().split('\t')
if split_line[0] not in master_dict:
master_dict[split_line[0]] = [split_line[1:]]
else:
master_dict[split_line[0]].append(split_line[1:])
print(master_dict)
which combines the data, so the key is the structureID and the values are a list of rows the structureId is involved in:
{'structureId': [['chainId', 'resolution', 'uniprotAcc', 'structureMolecularWeight']], '101M': [['A', '2.07', 'P02185', '18112.8']], '102L': [['A', '1.74', 'P00720', '18926.61']], '103D': [['A', '', '', '7502.93'], ['B', '', '', '7502.93']], '103L': [['A', '1.9', 'P00720', '19092.72']], '103M': [['A', '2.07', 'P02185', '18093.78']], '104L': [['A', '2.8', 'P00720', '37541.04'], ['B', '2.8', 'P00720', '37541.04']], '104M': [['A', '1.71', 'P02185', '18030.63'], ['A', '3.1', 'P09323', '2312.2']]}
I'm just stuck on one small thing, I know how to iterate through the dict:
for k in master_dict:
for each_list in master_dict[k]:
I'm just stuck on the very next line, how to say 'combine the lists that are identical except for their 1st (assuming list is starting at 0) item.
i.e. so turn:
104L A 2.8 P00720 37541.04
104L B 2.8 P00720 37541.04
into:
104L A,B 2.8 P00720 37541.04
I'm probably making it sound more complicated than it is, basically, for the rows in my table, if the only difference, per structureID and per uniProtacc, is the chainID column, combine the chainID columns.
Edit 1: Issue with answer below?
Say for example, this was the data:
structureId chainId resolution uniprotAcc structureMolecularWeight
6YC3 A 2.0 N0DKS8 181807.39
6YC3 B 2.0 N0DKS8 181807.39
6YC3 C 2.0 N0DKS8 181807.39
6YC3 D 2.0 N0DKS8 181807.39
6YC3 E 2.0 N0DKS8 181807.39
6YC4 A 2.6 N0DKS8 174142.86
6YC4 B 2.6 N0DKS8 174142.86
6YC4 C 2.6 N0DKS8 174142.86
6YC4 D 2.6 N0DKS8 174142.86
6YC4 E 2.6 N0DKS8 174142.86
So then the output should be:
6YC3 A,B,C,D,E 2.0 N0DKS8 181807.29
6YC4 A,B,C,D,E 2.6 N0DKS8 174142.86
Whereas the output from the code below is:
['6YC3', 'B,B,C,D,E,A,B,C,D,E', '2.0', 'N0DKS8', '181807.39']
Edit 2: To avoid issue above, I made a column that combined the UniProt accession and structureID:
structureId chainId resolution uniprotAcc structureMolecularWeight newcode
6YC3 A 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC3 B 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC3 C 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC3 D 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC3 E 2.0 N0DKS8 181807.39 N0DKS8_6YC3
6YC4 A 2.6 N0DKS8 174142.86 N0DKS8_6YC4
6YC4 B 2.6 N0DKS8 174142.86 N0DKS8_6YC4
6YC4 C 2.6 N0DKS8 174142.86 N0DKS8_6YC4
6YC4 D 2.6 N0DKS8 174142.86 N0DKS8_6YC4
6YC4 E 2.6 N0DKS8 174142.86 N0DKS8_6YC4
and then I just replaced the line in the code:
idx_uniprotAcc = headers.index("uniprotAcc") #to...
idx_uniprotAcc = headers.index("newcode")
When I run the exact same code as below, with just that one line changed, the output is:
['6YC3', 'B,B,C,D,E', '2.0', 'N0DKS8', '181807.39', 'N0DKS8_6YC3']
['6YC4', 'A,B,C,D,E', '2.6', 'N0DKS8', '174142.86', 'N0DKS8_6YC4']
Why is the first row returning 'B,B,C,D,E' and not 'A,B,C,D,E'. I think it's something to do with iterating over data[1:]?
Upvotes: 1
Views: 72
Reputation: 5500
Let's try the following approach:
Open the file and read all the lines. To do that, we can use readlines()
. It returns all the lines as a list
. (For more detail, this tuto explains how to use it).
strip
to clean the string.re
module. The re.split
method lets split according a regex. The pattern used is \s+
where \s
stands for space and +
means one and more. The first step can be summed up in the following two lines:
with open("data.txt") as f:
data = [re.split(r'\s+', line.strip()) for line in f.readlines()]
headers = data[0]
as it's the first lineIterate over all the lines. We use enumerate
to have the current index (and deduce the previous line).
uniprotAcc
: we update the last output line by adding the current chainId
Full code
import re
# Read file
with open("data.txt") as f:
data = [re.split(r'\s+', line.strip()) for line in f.readlines()]
print(data)
# Select headers
headers = data[0]
# Get index columns if not known
idx_uniprotAcc = headers.index("uniprotAcc")
idx_structureId = headers.index("structureId")
idx_chainId = headers.index("chainId")
# Remove header line
data = data[1:]
# In any case, we can add the header and first line to the output
out = [headers, data[0]]
print(out)
# Iterate over the lines starting at the second one
for i, line in enumerate(data[1:]):
# Get preivous line (i start at 0 but data is started at first line)
prev_line = data[i]
# print("prev: ", prev_line)
# print("current: ", line)
# Check line are the same and they both have all the values
# Here you can add as any column check as you want
# (here I just added one on "structureId" as this seems to match the output
# but to be sure, it's may be better to check all the columns)
if len(line) == len(headers) and \
len(prev_line) == len(headers) and \
line[idx_uniprotAcc] == prev_line[idx_uniprotAcc] and line[idx_structureId] == prev_line[idx_structureId]:
# Merge current with previous output line
out[-1][idx_chainId] += ",{}".format(line[idx_chainId])
else:
# Line is added
out.append(line)
[print(x) for x in out]
# ['102L', 'A', '1.74', 'P00720', '18926.61']
# ['102L', 'A', '1.74', 'P00720', '18926.61']
# ['103D', 'A', '7502.93']
# ['103D', 'B', '7502.93']
# ['103L', 'A', '1.9', 'P00720', '19092.72']
# ['103M', 'A', '2.07', 'P02185', '18093.78']
# ['104L', 'A,B', '2.8', 'P00720', '37541.04']
# ['104M', 'A', '1.71', 'P02185', '18030.63']
# ['104M', 'A', '3.1', 'P09323', '2312.2']
# ['6YC3', 'A,B,C,D,E', '2.0', 'N0DKS8', '181807.39']
# ['6YC4', 'A,B,C,D,E', '2.6', 'N0DKS8', '174142.86']
# Export in text file
# with open('output.txt', 'w') as f:
# f.writelines("%s\n" % " ".join(x) for x in out)
Hope that helps!
Upvotes: 1
Reputation: 4607
You can use zip
inbuilt to perform item wise contatenation. map
can can be use for further processing.
For given input -
item = [['A', '2.8', 'P00720', '37541.04'], ['B', '2.8', 'P00720', '37541.04']]
output=list(map(lambda t: t[0] if t[0]==t[1] else t[0]+","+t[1], list(zip(*a))))
The result is -
['A,B', '2.8', 'P00720', '37541.04']
Note: The lambda in map
assumes that at max 2 rows are getting contaminated. You could easily change that for n as well.
Upvotes: 1