Slowat_Kela
Slowat_Kela

Reputation: 1521

Combine almost identical dictionary value lists together per key

I have an input file like this:

structureId chainId resolution  uniprotAcc  structureMolecularWeight
101M    A   2.07    P02185  18112.8
102L    A   1.74    P00720  18926.61
103D    A                   7502.93
103D    B                   7502.93
103L    A   1.9     P00720  19092.72
103M    A   2.07    P02185  18093.78
104L    A   2.8     P00720  37541.04
104L    B   2.8     P00720  37541.04
104M    A   1.71    P02185  18030.63
104M    A   3.1     P09323  2312.2

I want the output to look like this:

structureId chainId resolution  uniprotAcc  structureMolecularWeight

101M    A   2.07    P02185  18112.8
102L    A   1.74    P00720  18926.61
103D    A                   7502.93
103D    B                   7502.93
103L    A   1.9     P00720  19092.72
103M    A   2.07    P02185  18093.78
104L    A,B 2.8     P00720  37541.04
104M    A   1.71    P02185  18030.63
104M    A   3.1     P09323  2312.2

i.e if col 'uniprotAcc' is the same for col 'structureId'; to combine them.

I wrote this code:

import sys

set_of_ids = list(set([line.strip().split('\t')[0] for line in open(sys.argv[1])]))

master_dict = {}
for line in open(sys.argv[1]):
    split_line = line.strip().split('\t')
    if split_line[0] not in master_dict:
        master_dict[split_line[0]] = [split_line[1:]]
    else:
        master_dict[split_line[0]].append(split_line[1:])

print(master_dict)

which combines the data, so the key is the structureID and the values are a list of rows the structureId is involved in:

{'structureId': [['chainId', 'resolution', 'uniprotAcc', 'structureMolecularWeight']], '101M': [['A', '2.07', 'P02185', '18112.8']], '102L': [['A', '1.74', 'P00720', '18926.61']], '103D': [['A', '', '', '7502.93'], ['B', '', '', '7502.93']], '103L': [['A', '1.9', 'P00720', '19092.72']], '103M': [['A', '2.07', 'P02185', '18093.78']], '104L': [['A', '2.8', 'P00720', '37541.04'], ['B', '2.8', 'P00720', '37541.04']], '104M': [['A', '1.71', 'P02185', '18030.63'], ['A', '3.1', 'P09323', '2312.2']]}

I'm just stuck on one small thing, I know how to iterate through the dict:

for k in master_dict:
    for each_list in master_dict[k]:

I'm just stuck on the very next line, how to say 'combine the lists that are identical except for their 1st (assuming list is starting at 0) item.

i.e. so turn:

104L    A   2.8     P00720  37541.04
104L    B   2.8     P00720  37541.04

into:

104L    A,B   2.8     P00720  37541.04

I'm probably making it sound more complicated than it is, basically, for the rows in my table, if the only difference, per structureID and per uniProtacc, is the chainID column, combine the chainID columns.

Edit 1: Issue with answer below?

Say for example, this was the data:

structureId chainId resolution  uniprotAcc  structureMolecularWeight
6YC3    A   2.0 N0DKS8  181807.39
6YC3    B   2.0 N0DKS8  181807.39
6YC3    C   2.0 N0DKS8  181807.39
6YC3    D   2.0 N0DKS8  181807.39
6YC3    E   2.0 N0DKS8  181807.39
6YC4    A   2.6 N0DKS8  174142.86
6YC4    B   2.6 N0DKS8  174142.86
6YC4    C   2.6 N0DKS8  174142.86
6YC4    D   2.6 N0DKS8  174142.86
6YC4    E   2.6 N0DKS8  174142.86

So then the output should be:

6YC3 A,B,C,D,E 2.0 N0DKS8 181807.29
6YC4 A,B,C,D,E 2.6 N0DKS8 174142.86

Whereas the output from the code below is:

['6YC3', 'B,B,C,D,E,A,B,C,D,E', '2.0', 'N0DKS8', '181807.39']

Edit 2: To avoid issue above, I made a column that combined the UniProt accession and structureID:

structureId chainId resolution  uniprotAcc  structureMolecularWeight    newcode
6YC3    A   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC3    B   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC3    C   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC3    D   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC3    E   2.0 N0DKS8  181807.39   N0DKS8_6YC3
6YC4    A   2.6 N0DKS8  174142.86   N0DKS8_6YC4
6YC4    B   2.6 N0DKS8  174142.86   N0DKS8_6YC4
6YC4    C   2.6 N0DKS8  174142.86   N0DKS8_6YC4
6YC4    D   2.6 N0DKS8  174142.86   N0DKS8_6YC4
6YC4    E   2.6 N0DKS8  174142.86   N0DKS8_6YC4

and then I just replaced the line in the code:

idx_uniprotAcc = headers.index("uniprotAcc") #to...
idx_uniprotAcc = headers.index("newcode")

When I run the exact same code as below, with just that one line changed, the output is:

['6YC3', 'B,B,C,D,E', '2.0', 'N0DKS8', '181807.39', 'N0DKS8_6YC3']
['6YC4', 'A,B,C,D,E', '2.6', 'N0DKS8', '174142.86', 'N0DKS8_6YC4']

Why is the first row returning 'B,B,C,D,E' and not 'A,B,C,D,E'. I think it's something to do with iterating over data[1:]?

Upvotes: 1

Views: 72

Answers (2)

Alexandre B.
Alexandre B.

Reputation: 5500

Let's try the following approach:

  1. Open the file and read all the lines. To do that, we can use readlines(). It returns all the lines as a list. (For more detail, this tuto explains how to use it).

    1. On each line, we apply strip to clean the string.
    2. Now we have a line, we want to extract the value from each columns. To do that, we will split the string according the spaces. However, the number of spaces between the values might change so we will use a regex with the re module. The re.split method lets split according a regex. The pattern used is \s+ where \s stands for space and + means one and more.

    The first step can be summed up in the following two lines:

with open("data.txt") as f:
    data = [re.split(r'\s+', line.strip()) for line in f.readlines()]
  1. Select the headers with headers = data[0] as it's the first line
  2. Iterate over all the lines. We use enumerate to have the current index (and deduce the previous line).

    • If the current and previous line have the same uniprotAcc: we update the last output line by adding the current chainId
    • Else: we add the current line to the output

Full code

import re

# Read file
with open("data.txt") as f:
    data = [re.split(r'\s+', line.strip()) for line in f.readlines()]
print(data)


# Select headers
headers = data[0]
# Get index columns if not known
idx_uniprotAcc = headers.index("uniprotAcc")
idx_structureId = headers.index("structureId")
idx_chainId = headers.index("chainId")
# Remove header line
data = data[1:]

# In any case, we can add the header and first line to the output
out = [headers, data[0]]
print(out)
# Iterate over the lines starting at the second one
for i, line in enumerate(data[1:]):
    # Get preivous line (i start at 0 but data is started at first line)
    prev_line = data[i]

    # print("prev:    ", prev_line)
    # print("current: ", line)

    # Check line are the same and they both have all the values
    # Here you can add as any column check as you want 
    # (here I just added one on "structureId" as this seems to match the output
    #  but to be sure, it's may be better to check all the columns)
    if len(line) == len(headers) and \
            len(prev_line) == len(headers) and \
            line[idx_uniprotAcc] == prev_line[idx_uniprotAcc] and line[idx_structureId] == prev_line[idx_structureId]:
        # Merge current with previous output line
        out[-1][idx_chainId] += ",{}".format(line[idx_chainId])
    else:
        # Line is added
        out.append(line)

[print(x) for x in out]
# ['102L', 'A', '1.74', 'P00720', '18926.61']
# ['102L', 'A', '1.74', 'P00720', '18926.61']
# ['103D', 'A', '7502.93']
# ['103D', 'B', '7502.93']
# ['103L', 'A', '1.9', 'P00720', '19092.72']
# ['103M', 'A', '2.07', 'P02185', '18093.78']
# ['104L', 'A,B', '2.8', 'P00720', '37541.04']
# ['104M', 'A', '1.71', 'P02185', '18030.63']
# ['104M', 'A', '3.1', 'P09323', '2312.2']
# ['6YC3', 'A,B,C,D,E', '2.0', 'N0DKS8', '181807.39']
# ['6YC4', 'A,B,C,D,E', '2.6', 'N0DKS8', '174142.86']

# Export in text file
# with open('output.txt', 'w') as f:
#     f.writelines("%s\n" % "  ".join(x) for x in out)

Hope that helps!

Upvotes: 1

mahoriR
mahoriR

Reputation: 4607

You can use zip inbuilt to perform item wise contatenation. map can can be use for further processing.

For given input -

item = [['A', '2.8', 'P00720', '37541.04'], ['B', '2.8', 'P00720', '37541.04']]

output=list(map(lambda t: t[0] if t[0]==t[1] else t[0]+","+t[1], list(zip(*a))))

The result is -

['A,B', '2.8', 'P00720', '37541.04']

Note: The lambda in map assumes that at max 2 rows are getting contaminated. You could easily change that for n as well.

Upvotes: 1

Related Questions