Reputation: 1135
Here is input.txt file
Jan_Feb 0.11
Jan_Mar -1.11
Jan_Apr 0.2
Feb_Jan 0.11
Feb_Mar -3.0
Mar_Jan -1.11
Mar_Feb -3.0
Mar_Apr 3.5
from this file, I am trying to create a dictionary from the input text file. 1) The keys are two values which is split with "_" from 1st column string of input file. 2) Moreover, if the name of column and row are same (such as Jan and Jan), write 0.0 as follows. 3) Lastly, if the keys are not found in the dictionary, write "NA". Output.txt
Jan Feb Mar Apr
Jan 0.0 0.11 -1.11 0.2
Feb 0.11 0.0 -3.0 NA
Mar -1.11 -3.0 0.0 3.5
Apr 0.2 NA 3.5 0.0
I would really appreciate if someone can help me figure out.
Actually, there are about 100,000,000 rows * 2 columns in real input.txt. The name of Thank you so much in advance.
Upvotes: 0
Views: 156
Reputation: 82600
What you first need to do is get the data in an understandable format. So, first, you need to create a row. I would get the data like so:
with open('test.txt') as f:
data = [(l.split()[0].split('_'), l.split()[1]) for l in f]
# Example:
# [(['Jan', 'Feb'], '0.11'), (['Jan', 'Mar'], '-1.11'), (['Jan', 'Apr'], '0.2'), (['Feb', 'Jan'], '0.11'), (['Feb', 'Mar'], '-3.0'), (['Mar', 'Jan'], '-1.11'), (['Mar', 'Feb'], '-3.0'), (['Mar', 'Apr'], '3.5')]
headers = set([var[0][0] for var in data] + [var[0][1] for var in data])
# Example:
# set(['Jan', 'Apr', 'Mar', 'Feb'])
What you then need to do is create a mapping from your headers
to your values, which are stored in data
. Ideally, you would need to create a table. Take a look at this answer to help you figure out how to do that (we can't write your code for you).
Secondly, in order to print things out properly, you will need to use the format
method. Ideally, it will help you deal with strings, and printing them in a specific fashion.
After that you can simply write like so with open('output.txt', 'w')
.
Upvotes: 0
Reputation: 16007
Others might disagree with this but one solution would be simply to read all 100 million lines into a relational database table (appropriately split
-ing out what you need, of course) using a module that interfaces with MySQL or SQLite:
Your_Table:
ID
Gene_Column
Gene_Row
Value
Once they're in there, you can query against the table in something that resembles English:
Get all of the column headings:
select distinct Gene_Column from Your_Table order by Gene_Column asc
Get all of the values for a particular row, and which columns they're in:
select Gene_Column, Value from Your_Table where Gene_Row = "Some_Name"
Get the value for a particular cell:
select Value from Your_Table where Gene_Row = "Some_Name" and Gene_Column = "Another_Name"
That, and you really don't want to shuffle around 100 million records any more than you have to. Reading all of them into memory may be problematic as well. Doing it this way, you can construct your matrix one row at a time, and output the row to your file.
It might not be the fastest, but it will probably be pretty clear and straightforward code.
Upvotes: 1
Reputation: 39403
Given the size of your input, I would split this in several passes on your file:
Read the file again to find the values to put in the matrix. There are several options.
Upvotes: 0
Reputation: 493
1) Determine all of the possible headers for resulting column/row. In your example, that is A-D. How you do this can vary. You can parse the file 2x (not ideal, but it might be necessary), or perhaps you have someplace you can refer to for the distinct columns.
2) Establish headers. In the example above, you would have headers=["A","B","C","D"]. You can build this up during #1 if you have to parse the first column. Use len(indexes) to determine N
3) Parse the data, this time consider both columns. You will get the two keys using .split("_") on the first column, then you get the index for your data by doing simple arithmetic:
x,y = [headers.index(a) for a in row[0].split("_")]
data[x+y*len(headers)] = row[1]
This should be relatively fast, except for the parsing of the file twice. If it can fit into memory, you could load your file into memory and then scan over it twice, or use command line tricks to establish those header entries.
-- I should say that you will need to determine N before you begin storing the actual data. (i.e. data=[0]*N). Also, you'll need to use x+y*len(headers) during the save as well. If you are using numpy, you can use reshape to get an actual row/col layout which will be a little easier to manipulate and print (i.e. data[x,y]=row[1])
If you do a lot of large data manipulation, especially if you might be performing calculations, you really should look into learning numpy (www.numpy.org).
Upvotes: 0
Reputation: 794
If you want a dictionary as a result, something like that:
dico = {}
keyset=set()
with open('input.txt','r') as file:
line = file.readline()
keys = line.split('\t')[0]
value = line.split('\t')[1]
key1 = keys.split('_')[0]
keyset.add(key1)
key2 = keys.split('_')[1]
keyset.add(key2)
if key1 not in dico:
dico[key1] = {}
dico[key1][key2] = value
for key in keyset:
dico[key][key] = 0.0
for secondkey in keyset:
if secondkey not in dico[key].keys():
dico[key][secondkey]="NA"
Upvotes: 0
Reputation: 344
matrix = dict()
with open('inpu.txt') as f:
content = f.read()
tmps = content.split('\n')
for tmp in tmps:
s = tmp.split(' ')
latter = s[0].split('_')
try:
if latter[0] in matrix:
matrix[latter[0]][latter[1]] = s[1]
else:
matrix[latter[0]] = dict()
matrix[latter[0]][latter[1]] = s[1]
except:
pass
print matrix
And now in matrix you have table what u want.
Upvotes: 0