john
john

Reputation: 263

fixing the code in python to change a text file

I have a big text file like the small example:

small example:

chr1    37091   37122   D00645:305:CCVLRANXX:1:1104:21074:48301 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1104:4580:50451  0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1106:13064:5974  0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1106:16735:48726 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:2210:5043:83540  0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:2204:15744:24410 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:2204:19627:73060 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:2206:8497:68295  0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1312:11371:24672 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1312:17050:42431 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1312:12969:62696 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1312:6478:73521  0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1312:8402:80222  0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1309:19837:15007 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1309:20126:89687 0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1310:2838:27860  0   -
chr1    37091   37122   D00645:305:CCVLRANXX:1:1310:7280:85906  0   -
chr1    54832   54863   D00645:305:CCVLRANXX:1:2102:19886:3949  0   -
chr1    74307   74338   D00645:305:CCVLRANXX:1:2203:13233:29983 0   -
chr1    74325   74356   D00645:305:CCVLRANXX:1:1310:7266:92995  0   -
chr1    93529   93560   D00645:305:CCVLRANXX:1:1103:1743:29602  0   +
chr1    93529   93560   D00645:305:CCVLRANXX:1:1101:16098:97354 0   +

I am trying to count the lines with similar 1st, 2nd and 3rd columns and make a new file with 4 columns in which the first 3 columns are similar to the original file but the 4th column is number of times that every row is repeated. for example there 17 rows with chr1 37091 37122 here is the expected output for the above small example:

expected output:

chr1    37091   37122   17
chr1    54832   54863   1
chr1    74307   74338   1
chr1    74325   74356   1
chr1    93529   93560   2

I wrote this code in python but it does not return what I want. do you how to fix it?

infile = open('infile.txt', 'rb')
content = []
for i in infile:
    content.append(i.split())

final = []
for j in range(len(content)):
    if content[j] == content[j-1]:
        final.append(content[j])

with open('outfile.txt','w') as f:
    for sublist in final:
        for item in sublist:
            f.write(item + '\t')
        f.write('\n')

Upvotes: 0

Views: 82

Answers (4)

r.ook
r.ook

Reputation: 13898

Here's one way to do it:

with open('infile.txt', 'r') as file:
    content = [i.split() for i in file.readlines()]

results = {}
for i in data:
    # use .setdefault to set counter as 0, increment at each match.
    results.setdefault('\t'.join(i[:3]), 0)
    results['\t'.join(i[:3])] += 1

# results

# {'chr1\t37091\t37122': 17, 
#  'chr1\t54832\t54863': 1, 
#  'chr1\t74307\t74338': 1,
#  'chr1\t74325\t74356': 1, 
#  'chr1\t93529\t93560': 2}

# Output the results with list comprehension
with open('outfile.txt', 'w') as file:
    file.writelines('\t'.join((k, str(v))) for k, v in results.items())

Or, just use Counter:

import Counter
with open('infile.txt', 'r') as file:
    data = ['\t'.join(i.split()[:3]) for i in file.readlines()]

with open('outfile.txt', 'w') as file:
    file.writelines('\t'.join((k, str(v))) for k, v in Counter(data).items())

# Counter(data).items()

# dict_items([('chr1\t37091\t37122', 17),
#             ('chr1\t54832\t54863', 1), 
#             ('chr1\t74307\t74338', 1), 
#             ('chr1\t74325\t74356', 1),
#             ('chr1\t93529\t93560', 2)])

In either case we group the first three "columns" as a key, then use said key to identify the number of times it occured in your data.

Upvotes: 0

atru
atru

Reputation: 4744

You can use a regular dictionary with your target comparison lines as keys:

infile = 'infile.txt'
content = {}

with open(infile, 'r') as fin:
    for line in fin:
        temp = line.split()
        if not temp[1]+temp[2] in content:
            content[temp[1]+temp[2]] = [1, temp[0:3]]
        else:
            content[temp[1]+temp[2]][0]+=1

with open('outfile.txt','w') as fout:
    for key, value in content.items():
        for entry in value[1]:
            fout.write(entry + ' ')
        fout.write(str(value[0]) + '\n')

The key is a concatenated second and third column. The value is a list - first element is the counter and second element is a list of values from your input file you want to save to the output. The if checks if there is already an entry with given key - if yes, it increments the counter, if not - it creates a new list with counter set to 1 and the appropriate values as the list part.

Note that for consistency the program uses the recommended with open in both cases. It also doesn't read the txt file in binary mode.

Upvotes: 1

Mayank Porwal
Mayank Porwal

Reputation: 34086

You can also use pandas and your solution will be really easy:

Just read the big txt file in a pandas dataframe like:

df = pd.read_csv('infile.txt', sep=' ')
df.groupby([0,1,2]).count()

This should give you:

chr1 37091 37122     17
     74325 74356      1
     93529 93560      2

Let me know if this helps.

Upvotes: 1

Novak
Novak

Reputation: 2171

You can use Counter like this:

from collections import Counter

infile = open('infile.txt', 'rb')
content = []
for i in infile:
    # append only first 3 columns as one line string
    content.append('  '.join(i.split()[:3]))

# this is now dictionary
c = Counter(content)


elements = c.most_common(len(c.elements()))

with open('outfile.txt','w') as f:
    for item, freq in elements:
        f.write('{}\t{}\n'.format(item, freq))

Upvotes: 1

Related Questions