john
john

Reputation: 263

how to get the statistics in a text file in python

I have a big text file like the small example:

small example:

chr1    10385347    10385379    11  chr1    10000000    11000000
chr1    10385348    10385379    40  chr1    10000000    11000000
chr1    10385348    10385379    40  chr1    10000000    11000000
chr1    10385348    10385379    381 chr1    10000000    11000000
chr1    10561009    10561040    12  chr1    10000000    11000000
chr1    10561009    10561040    24  chr1    10000000    11000000
chr1    10647768    10647799    68  chr1    10000000    11000000
chr1    10958095    10958126    17  chr1    10000000    11000000
chr1    11196862    11196893    39  chr1    11000000    12000000
chr1    11921548    11921579    56  chr1    11000000    12000000
chr1    13967589    13967620    111 chr1    13000000    14000000
chr1    15290638    15290669    11  chr1    15000000    16000000
chr1    15587268    15587299    32  chr1    15000000    16000000
chr1    15587268    15587299    13  chr1    15000000    16000000

I want to count the number of rows in which columns 6 and 7 are the same. and then make a new file in which there are 4 columns including 'column5', 'column6', 'column7', 'count'. the output for the small example would look like this:

expected output:

chr1    10000000    11000000    8
chr1    11000000    12000000    2
chr1    13000000    14000000    1
chr1    15000000    16000000    3

I am trying to do that in python and wrote this code:

file = open('infile.txt', 'rb')
line = []
for i in file:
    line.append(i.split())
    count = 0
    new_list = []
    for j in range(len(line)):
        if line[j] == line[j-i]:
            count+=1
            new_list.append(count)

with open(outfile, "w") as f:
    for item in new_list:
        f.write("%s\n" % item)

but it does not return the output I expect. do you know how to fix it?

Upvotes: 0

Views: 2284

Answers (5)

Poonam Agrawal
Poonam Agrawal

Reputation: 51

Statistical Summary

import pandas

url = "put csv file here "

put any csv file url here or load csv here

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pandas.read_csv(url, names=names)

description = data.describe()

print(description)

Upvotes: -1

evantkchong
evantkchong

Reputation: 2616

An approach you might want to consider is to load the text file as a Pandas dataframe and work from there using the library functions. One caveat is that this method will be slow with very large datasets.

You'll need the Pandas and Numpy libraries which are imported as follows

import numpy as np
import pandas as pd

Next you can import the data as a dataframe.

#Passing `names = ['column1','etc']` as an argument lets us define the headers to be used for each column
#As the name suggests, `delim_whitespace=True` makes whitespace the delimiter.
df = pd.read_csv('filename.txt',names=['column1','column2','column3','column4','column5','column6','column7'], delim_whitespace=True)

Which generates the following dataframe:

   column1   column2   column3  column4 column5   column6   column7
0     chr1  10385347  10385379       11    chr1  10000000  11000000
1     chr1  10385348  10385379       40    chr1  10000000  11000000
2     chr1  10385348  10385379       40    chr1  10000000  11000000
3     chr1  10385348  10385379      381    chr1  10000000  11000000
4     chr1  10561009  10561040       12    chr1  10000000  11000000
5     chr1  10561009  10561040       24    chr1  10000000  11000000
6     chr1  10647768  10647799       68    chr1  10000000  11000000
7     chr1  10958095  10958126       17    chr1  10000000  11000000
8     chr1  11196862  11196893       39    chr1  11000000  12000000
9     chr1  11921548  11921579       56    chr1  11000000  12000000
10    chr1  13967589  13967620      111    chr1  13000000  14000000
11    chr1  15290638  15290669       11    chr1  15000000  16000000
12    chr1  15587268  15587299       32    chr1  15000000  16000000
13    chr1  15587268  15587299       13    chr1  15000000  16000000

With the dataframe set up, we can now use Pandas functions to help manipulate the dataframe.

To make a new dataframe where we have the counts of duplicated rows 5,6 and 7 you may use the following (I'm assuming you have other values other than 'chr1' in row 5 so perhaps we'd like to include that row when counting duplicates?):

#groupby(['column5','column6','column7']) means that we're looking for duplicates across columns 5,6,7 and grouping them together
#.size() returns the number of duplicates aka the size of each group
#.reset_index().rename(columns={0:'count'}) is just giving the new column of duplicate numbers a header 'count'
dfnew = df.groupby(['column5','column6','column7']).size().reset_index().rename(columns={0:'count'})

This gives me dfnew:

  column5   column6   column7  count
0    chr1  10000000  11000000      8
1    chr1  11000000  12000000      2
2    chr1  13000000  14000000      1
3    chr1  15000000  16000000      3

With this new dataframe, it is a simple matter to generate a file with the data.

#The argument sep=' ' indicates that one space is used as a delimiter for the file
#The argument index=False omits the row indexes when writing to the file
df3.to_csv('newfile.txt',sep=' ',index=False)

The final file is as such:

column5 column6 column7 count
chr1 10000000 11000000 8
chr1 11000000 12000000 2
chr1 13000000 14000000 1
chr1 15000000 16000000 3

I hope this helps!

References used:

Pandas Docs

How to count duplicate rows in pandas dataframe?

Upvotes: -1

NutCracker
NutCracker

Reputation: 12273

Here is my working solution to your problem. Others can comment if there are some more efficient ways to do this.

result = []

with open('infile.txt', 'rb') as infile:
    text = infile.read()
    for line in text.splitlines():
        fields = line.split()

        if any(x['6th'] == fields[5] and x['7th'] == fields[6] for x in result):
            for x in result:
                if x['6th'] == fields[5] and x['7th'] == fields[6]:
                    x['counter'] += 1
                    break
        else:
            result.append({
                '5th': fields[4],
                '6th': fields[5],
                '7th': fields[6],
                'counter': 1
            })

with open('outfile.txt', 'w') as outfile:
    for x in result:
        outfile.write('{} {} {} {}\n'.format(
            x['5th'],
            x['6th'],
            x['7th'],
            x['counter']
        ))

Since I don't know the context of your project and the meanings of the values, I just put dummy names like 5th, 6th and 7th.

Upvotes: 0

BarathVutukuri
BarathVutukuri

Reputation: 1313

Not the answer, but it will help you in getting through. Read every line from file with a delimiter, say 2 spaces in your case. Split each line and take 5th, 6th parts and combine them and make it a key of dictionary and value as 1. Iterate each line in the file, generate the key, increment by 1 each time key is found. I am assuming col-1 is same for all the lines. Otherwise, include col-1 also in the key.

You can do the below operation for each line in your file and get the statistics at the end when all the lines are read.

import re
s="chr1    10385347    10385379    11  chr1    10000000    11000000"
re.sub(' +',' ',s)
res={}
s=re.sub(' +',' ',s)
res[s.split(" ")[5]+"-"+s.split(" ")[6]]=1

Upvotes: 1

user3142459
user3142459

Reputation: 640

Python is famous to not necessary to write such a monolith code. Maybe if you split it up in multiple functional calls it would be more easy to read and to debug.

just hints without code:

  1. Read the lines from file.
  2. Replace all whitespaces by a single space (use regex for this)
  3. filter all lines with your criteria (matching column 6 and 7)
  4. Write the filtered lines to the a file.

Upvotes: 2

Related Questions