user2951046
user2951046

Reputation: 21

Python data extraction and search

I have data in a file in this format:

+1 1:4 2:11 3:3 4:11 5:1 6:13 7:4 8:2 9:2 10:13
-1 1:2 2:7 3:4 4:12 5:3 6:4 7:3 8:12 9:2 10:12
+1 1:4 2:6 3:3 4:2 5:3 6:5 7:4 8:2 9:3 10:6

and so on....

where the numbers on the left of the colon is an 'index' and numbers on the right of the colon are integers that describe a certain attribute. For each line, if the number on the right of the colon is the same for the same index on another line, I want to store the total amount of +1's and -1's in two separate variables. This is my code so far:

for i in lines:
   for word in i:
        if word.find(':')!=-1:
            att = word.split(':', 1)[-1]
            idx = word.split(':', 1)[0]
            for j in lines:
                clas = j.split(' ', 1)[0]
                if word.find(':')!=-1:
                        if idx ==word.split(':', 1)[0]:
                            if att ==word.split(':', 1)[0]:
                                if clas>0:
                                    ifattandyes = ifattandyes+1
                                else:
                                    ifattandno = ifattandno+1

My problem is att and idx do not seem to update as I think word.find(':') just finds the first instance of a colon and runs with it. Can anyone help?

EDIT:

The above explanation has been confusing. I'm a bit stubborn about how the count of 1s and -1s is acquired. As each pair on each line is read, I want to search through the data for the number of +1s and -1s that the pair is involved in and store them into 2 separate variables. The reason for doing so is to calculate probabilities of each pair leading to a +1 or -1.

Upvotes: 0

Views: 104

Answers (4)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Your first error is in the second line:

for word in i:

this iterates over each character.

You meant to use:

for word in i.split():

Upvotes: 0

damienfrancois
damienfrancois

Reputation: 59090

Here is a suggestion (provided I understand the question correctly) :

#!/bin/env python
from collections import defaultdict

positives=defaultdict(int)
negatives=defaultdict(int)

for line in open('data'):
    theclass = line[0:2] == '+1'
    for pair in line[2:].split():
        positives[pair]+=theclass
        negatives[pair]+=not theclass

for key in positives.keys():
    print key, "\t+1:",  positives[key], "\t-1:", negatives[key]

Applied to the following data:

$ cat data
+1 1:4 2:11 3:3 4:11 5:1 6:13 7:4 8:2 9:2 10:13
-1 1:2 2:7 3:4 4:12 5:3 6:4 7:3 8:12 9:2 10:12
+1 1:4 2:6 3:3 4:2 5:3 6:5 7:4 8:2 9:3 10:6

it gives:

$ python t.py 
9:2     +1: 1   -1: 1
9:3     +1: 1   -1: 0
8:2     +1: 2   -1: 0
10:6    +1: 1   -1: 0
6:13    +1: 1   -1: 0
10:13   +1: 1   -1: 0
10:12   +1: 0   -1: 1
2:7     +1: 0   -1: 1
2:6     +1: 1   -1: 0
4:11    +1: 1   -1: 0
4:12    +1: 0   -1: 1
4:2     +1: 1   -1: 0
1:2     +1: 0   -1: 1
1:4     +1: 2   -1: 0
3:3     +1: 2   -1: 0
5:1     +1: 1   -1: 0
3:4     +1: 0   -1: 1
5:3     +1: 1   -1: 1
8:12    +1: 0   -1: 1
7:4     +1: 2   -1: 0
7:3     +1: 0   -1: 1
2:11    +1: 1   -1: 0
6:5     +1: 1   -1: 0
6:4     +1: 0   -1: 1

Upvotes: 3

DSM
DSM

Reputation: 353039

I'll make this community wiki because it's too close (in spirit, anyway) to an answer already posted, but it has a few advantages:

from collections import Counter
with open("datafile.dat") as fp:
    counts = {}
    for line in fp:
        parts = line.split()
        sign, keys = parts[0], parts[1:]
        counts.setdefault(sign, Counter()).update(keys)

all_keys = set().union(*counts.values())
for key in sorted(all_keys):
    print '{:8}'.format(key), 
    print ' '.join('{}: {}'.format(c, counts[c].get(key, 0)) for c in counts)

which produces

10:12    +1: 0 -1: 1
10:13    +1: 1 -1: 0
10:6     +1: 1 -1: 0
1:2      +1: 0 -1: 1
1:4      +1: 2 -1: 0
[etc.]

Note that nowhere is there any reference to +1 or -1; sign can really be anything.

Upvotes: 0

Geoff Gerrietts
Geoff Gerrietts

Reputation: 676

I'm not sure if I've got this or not.

tot_up = {}; tot_dn = {}
for line in input_file:
    parts = line.split()   # split on whitespace
    up_or_down = parts[0]
    parts = parts[1:]
    if up_or_down == '-1':
        store = tot_dn
    else:
        store = tot_up
    for part in parts:
        store[part] = store.get(part, 0) + 1
print "Total +1s: ", sum(tot_up.values())
print "Total -1s: ", sum(tot_dn.values())

What this does not do, but could be done easily enough, is strip out the att:val pairs where no match was found.

But I'm not sure I've understood your requirements properly.

Upvotes: 1

Related Questions