alvas
alvas

Reputation: 122250

How to extract rows with non-zeros column values?

Given a tsv file like this:

doc_id/query_id 1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
1000001 0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1000002 0   0   0   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

The first row is the header role with the doc_id/query_id as the first column header and 150 integer from [1,150].

The value rows is made up of an ID in the first column and zeroes or ones other columns.

The goal is to extract pairs of the IDs and the names of the columns where it's non-zero, e.g. given the two rows of data above the desired output is:

1000001 4
1000001 9
1000002 7
1000002 8

There are 800,000 rows in the data, so I'll avoid pandas and use sframe, I've tried:

import turicreate as tc
from tqdm import tqdm

df = tc.SFrame('data.tsv')

with open('ground_truth.non-zeros.tsv', 'w') as fout:
    for i in tqdm(range(len(df))):
        for j in range(1,151):
            if df[i][str(j)]:
                print(df[i]['doc_id/query_id', j)

Is there a simpler way to extract the non-zeros values and the row IDs?

Pandas solutions or other dataframe solutions are appreciated too! Please do state the limitations if known and if any =)

Upvotes: 2

Views: 293

Answers (3)

BENY
BENY

Reputation: 323366

Here is one way based on numpy, I think should slightly speed up the whole process

t,v=np.where(df.iloc[:,1:]==1)
list(zip(df['doc_id/query_id'].iloc[t],df.columns[v+1]))
Out[135]: [(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')]

Upvotes: 2

C.Nivs
C.Nivs

Reputation: 13106

A non-pandas answer, you could just iterate over your file, and grab the columns where necessary:

results = []

with open('yourfile.csv') as fh:
    headers = next(fh).split()
    for line in fh:
        _id, *line = line.split()
        non_zero = [{_id: header} for header, val in zip(headers[1:], line) if val!="0"]
        results.extend(non_zero)

# Where you now have the option to throw it into whatever data structure you want
results

[{'1000001': '4'}, {'1000001': '9'}, {'1000002': '7'}, {'1000002': '8'}]

This way you don't load the entire file into memory, you only grab what you need, though you do pay for the list.extend operation

Upvotes: 1

cs95
cs95

Reputation: 403012

Here's a pandaic approach using stack and query:

(df.set_index('doc_id/query_id')
   .stack()
   .to_frame('tmp')
   .query('tmp == 1')
   .index
   .values)

array([(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')],
      dtype=object)

This is an elegance first, performance later approach.


You can also start with numpy, this is for max performance.

arr = np.loadtxt(filename, skiprows=1, usecols=np.r_[1:151], dtype=int)
index = np.loadtxt(filename, skiprows=1, usecols=[0], dtype=int)

r, c = np.where(arr)
np.column_stack([index[r], c+1])

array([[1000001,       4],
       [1000001,       9],
       [1000002,       7],
       [1000002,       8]])

Upvotes: 2

Related Questions