user2913161
user2913161

Reputation: 1135

Python_How to parse data and make it into new file

Here is input.txt file

Jan_Feb    0.11
Jan_Mar   -1.11
Jan_Apr    0.2
Feb_Jan    0.11
Feb_Mar   -3.0
Mar_Jan   -1.11
Mar_Feb   -3.0
Mar_Apr    3.5

from this file, I am trying to create a dictionary from the input text file. 1) The keys are two values which is split with "_" from 1st column string of input file. 2) Moreover, if the name of column and row are same (such as Jan and Jan), write 0.0 as follows. 3) Lastly, if the keys are not found in the dictionary, write "NA". Output.txt

         Jan     Feb       Mar      Apr
 Jan     0.0     0.11     -1.11     0.2
 Feb     0.11    0.0      -3.0      NA
 Mar    -1.11    -3.0      0.0      3.5    
 Apr     0.2     NA        3.5      0.0

I would really appreciate if someone can help me figure out.
Actually, there are about 100,000,000 rows * 2 columns in real input.txt. The name of Thank you so much in advance.

Upvotes: 0

Views: 156

Answers (6)

Nafiul Islam
Nafiul Islam

Reputation: 82600

What you first need to do is get the data in an understandable format. So, first, you need to create a row. I would get the data like so:

with open('test.txt') as f:
    data = [(l.split()[0].split('_'), l.split()[1]) for l in f]
    # Example:
    # [(['Jan', 'Feb'], '0.11'), (['Jan', 'Mar'], '-1.11'), (['Jan', 'Apr'], '0.2'),                   (['Feb', 'Jan'], '0.11'), (['Feb', 'Mar'], '-3.0'), (['Mar', 'Jan'], '-1.11'), (['Mar', 'Feb'], '-3.0'), (['Mar', 'Apr'], '3.5')]
    headers = set([var[0][0] for var in data] + [var[0][1] for var in data])
    # Example:
    # set(['Jan', 'Apr', 'Mar', 'Feb'])

What you then need to do is create a mapping from your headers to your values, which are stored in data. Ideally, you would need to create a table. Take a look at this answer to help you figure out how to do that (we can't write your code for you).

Secondly, in order to print things out properly, you will need to use the format method. Ideally, it will help you deal with strings, and printing them in a specific fashion.

After that you can simply write like so with open('output.txt', 'w').

Upvotes: 0

John
John

Reputation: 16007

Others might disagree with this but one solution would be simply to read all 100 million lines into a relational database table (appropriately split-ing out what you need, of course) using a module that interfaces with MySQL or SQLite:

Your_Table:
    ID
    Gene_Column
    Gene_Row
    Value

Once they're in there, you can query against the table in something that resembles English:

Get all of the column headings:

select distinct Gene_Column from Your_Table order by Gene_Column asc

Get all of the values for a particular row, and which columns they're in:

select Gene_Column, Value from Your_Table where Gene_Row = "Some_Name"

Get the value for a particular cell:

select Value from Your_Table where Gene_Row = "Some_Name" and Gene_Column = "Another_Name"

That, and you really don't want to shuffle around 100 million records any more than you have to. Reading all of them into memory may be problematic as well. Doing it this way, you can construct your matrix one row at a time, and output the row to your file.

It might not be the fastest, but it will probably be pretty clear and straightforward code.

Upvotes: 1

njzk2
njzk2

Reputation: 39403

Given the size of your input, I would split this in several passes on your file:

  1. First pass to identify the headers (loop on all lines, read the first group, find the headers)
  2. Read the file again to find the values to put in the matrix. There are several options.

    • The simplest (but slow) option is to read the whole file for each line of your matrix, identifying only the lines you need in the file. This way, you only have one line of the matrix in memory at a time
    • From your example file, it seems your input file is properly sorted. If it is not, it can be a good idea to sort it. This way, you know that the line in the input file you are reading is the next cell in your matrix (except of course for the 0 diagonal which you only need to add.)

Upvotes: 0

user632657
user632657

Reputation: 493

1) Determine all of the possible headers for resulting column/row. In your example, that is A-D. How you do this can vary. You can parse the file 2x (not ideal, but it might be necessary), or perhaps you have someplace you can refer to for the distinct columns.

2) Establish headers. In the example above, you would have headers=["A","B","C","D"]. You can build this up during #1 if you have to parse the first column. Use len(indexes) to determine N

3) Parse the data, this time consider both columns. You will get the two keys using .split("_") on the first column, then you get the index for your data by doing simple arithmetic:

x,y = [headers.index(a) for a in row[0].split("_")]
data[x+y*len(headers)] = row[1]

This should be relatively fast, except for the parsing of the file twice. If it can fit into memory, you could load your file into memory and then scan over it twice, or use command line tricks to establish those header entries.

-- I should say that you will need to determine N before you begin storing the actual data. (i.e. data=[0]*N). Also, you'll need to use x+y*len(headers) during the save as well. If you are using numpy, you can use reshape to get an actual row/col layout which will be a little easier to manipulate and print (i.e. data[x,y]=row[1])

If you do a lot of large data manipulation, especially if you might be performing calculations, you really should look into learning numpy (www.numpy.org).

Upvotes: 0

Rudy Bunel
Rudy Bunel

Reputation: 794

If you want a dictionary as a result, something like that:

dico = {}
keyset=set()
with open('input.txt','r') as file:
    line = file.readline()
    keys = line.split('\t')[0]
    value = line.split('\t')[1]
    key1 = keys.split('_')[0]
    keyset.add(key1)
    key2 = keys.split('_')[1]
    keyset.add(key2)

    if key1 not in dico:
        dico[key1] = {}
    dico[key1][key2] = value

for key in keyset:
    dico[key][key] = 0.0
    for secondkey in keyset:
        if secondkey not in dico[key].keys():
            dico[key][secondkey]="NA" 

Upvotes: 0

Naster
Naster

Reputation: 344

matrix = dict()
with open('inpu.txt') as f:
   content = f.read()
   tmps = content.split('\n')
   for tmp in tmps:
     s = tmp.split(' ')
     latter = s[0].split('_')
     try:
       if latter[0] in matrix:
         matrix[latter[0]][latter[1]] = s[1]
       else:
         matrix[latter[0]] = dict()
         matrix[latter[0]][latter[1]] = s[1]
     except:
       pass
print matrix

And now in matrix you have table what u want.

Upvotes: 0

Related Questions