Reputation: 263
I have a big text file like the small example:
small example:
chr1 10385347 10385379 11 chr1 10000000 11000000
chr1 10385348 10385379 40 chr1 10000000 11000000
chr1 10385348 10385379 40 chr1 10000000 11000000
chr1 10385348 10385379 381 chr1 10000000 11000000
chr1 10561009 10561040 12 chr1 10000000 11000000
chr1 10561009 10561040 24 chr1 10000000 11000000
chr1 10647768 10647799 68 chr1 10000000 11000000
chr1 10958095 10958126 17 chr1 10000000 11000000
chr1 11196862 11196893 39 chr1 11000000 12000000
chr1 11921548 11921579 56 chr1 11000000 12000000
chr1 13967589 13967620 111 chr1 13000000 14000000
chr1 15290638 15290669 11 chr1 15000000 16000000
chr1 15587268 15587299 32 chr1 15000000 16000000
chr1 15587268 15587299 13 chr1 15000000 16000000
I want to count the number of rows in which columns 6 and 7 are the same. and then make a new file in which there are 4 columns
including 'column5
', 'column6
', 'column7
', 'count
'.
the output for the small example would look like this:
expected output:
chr1 10000000 11000000 8
chr1 11000000 12000000 2
chr1 13000000 14000000 1
chr1 15000000 16000000 3
I am trying to do that in python
and wrote this code:
file = open('infile.txt', 'rb')
line = []
for i in file:
line.append(i.split())
count = 0
new_list = []
for j in range(len(line)):
if line[j] == line[j-i]:
count+=1
new_list.append(count)
with open(outfile, "w") as f:
for item in new_list:
f.write("%s\n" % item)
but it does not return the output I expect. do you know how to fix it?
Upvotes: 0
Views: 2284
Reputation: 51
import pandas
url = "put csv file here "
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
description = data.describe()
print(description)
Upvotes: -1
Reputation: 2616
An approach you might want to consider is to load the text file as a Pandas dataframe and work from there using the library functions. One caveat is that this method will be slow with very large datasets.
You'll need the Pandas and Numpy libraries which are imported as follows
import numpy as np
import pandas as pd
Next you can import the data as a dataframe.
#Passing `names = ['column1','etc']` as an argument lets us define the headers to be used for each column
#As the name suggests, `delim_whitespace=True` makes whitespace the delimiter.
df = pd.read_csv('filename.txt',names=['column1','column2','column3','column4','column5','column6','column7'], delim_whitespace=True)
Which generates the following dataframe:
column1 column2 column3 column4 column5 column6 column7
0 chr1 10385347 10385379 11 chr1 10000000 11000000
1 chr1 10385348 10385379 40 chr1 10000000 11000000
2 chr1 10385348 10385379 40 chr1 10000000 11000000
3 chr1 10385348 10385379 381 chr1 10000000 11000000
4 chr1 10561009 10561040 12 chr1 10000000 11000000
5 chr1 10561009 10561040 24 chr1 10000000 11000000
6 chr1 10647768 10647799 68 chr1 10000000 11000000
7 chr1 10958095 10958126 17 chr1 10000000 11000000
8 chr1 11196862 11196893 39 chr1 11000000 12000000
9 chr1 11921548 11921579 56 chr1 11000000 12000000
10 chr1 13967589 13967620 111 chr1 13000000 14000000
11 chr1 15290638 15290669 11 chr1 15000000 16000000
12 chr1 15587268 15587299 32 chr1 15000000 16000000
13 chr1 15587268 15587299 13 chr1 15000000 16000000
With the dataframe set up, we can now use Pandas functions to help manipulate the dataframe.
To make a new dataframe where we have the counts of duplicated rows 5,6 and 7 you may use the following (I'm assuming you have other values other than 'chr1' in row 5 so perhaps we'd like to include that row when counting duplicates?):
#groupby(['column5','column6','column7']) means that we're looking for duplicates across columns 5,6,7 and grouping them together
#.size() returns the number of duplicates aka the size of each group
#.reset_index().rename(columns={0:'count'}) is just giving the new column of duplicate numbers a header 'count'
dfnew = df.groupby(['column5','column6','column7']).size().reset_index().rename(columns={0:'count'})
This gives me dfnew:
column5 column6 column7 count
0 chr1 10000000 11000000 8
1 chr1 11000000 12000000 2
2 chr1 13000000 14000000 1
3 chr1 15000000 16000000 3
With this new dataframe, it is a simple matter to generate a file with the data.
#The argument sep=' ' indicates that one space is used as a delimiter for the file
#The argument index=False omits the row indexes when writing to the file
df3.to_csv('newfile.txt',sep=' ',index=False)
The final file is as such:
column5 column6 column7 count
chr1 10000000 11000000 8
chr1 11000000 12000000 2
chr1 13000000 14000000 1
chr1 15000000 16000000 3
I hope this helps!
References used:
How to count duplicate rows in pandas dataframe?
Upvotes: -1
Reputation: 12273
Here is my working solution to your problem. Others can comment if there are some more efficient ways to do this.
result = []
with open('infile.txt', 'rb') as infile:
text = infile.read()
for line in text.splitlines():
fields = line.split()
if any(x['6th'] == fields[5] and x['7th'] == fields[6] for x in result):
for x in result:
if x['6th'] == fields[5] and x['7th'] == fields[6]:
x['counter'] += 1
break
else:
result.append({
'5th': fields[4],
'6th': fields[5],
'7th': fields[6],
'counter': 1
})
with open('outfile.txt', 'w') as outfile:
for x in result:
outfile.write('{} {} {} {}\n'.format(
x['5th'],
x['6th'],
x['7th'],
x['counter']
))
Since I don't know the context of your project and the meanings of the values, I just put dummy names like 5th
, 6th
and 7th
.
Upvotes: 0
Reputation: 1313
Not the answer, but it will help you in getting through. Read every line from file with a delimiter, say 2 spaces in your case. Split each line and take 5th, 6th parts and combine them and make it a key of dictionary and value as 1. Iterate each line in the file, generate the key, increment by 1 each time key is found. I am assuming col-1 is same for all the lines. Otherwise, include col-1 also in the key.
You can do the below operation for each line in your file and get the statistics at the end when all the lines are read.
import re
s="chr1 10385347 10385379 11 chr1 10000000 11000000"
re.sub(' +',' ',s)
res={}
s=re.sub(' +',' ',s)
res[s.split(" ")[5]+"-"+s.split(" ")[6]]=1
Upvotes: 1
Reputation: 640
Python is famous to not necessary to write such a monolith code. Maybe if you split it up in multiple functional calls it would be more easy to read and to debug.
just hints without code:
Upvotes: 2