Reputation: 122250
Given a tsv file like this:
doc_id/query_id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
1000001 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1000002 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The first row is the header role with the doc_id/query_id
as the first column header and 150 integer from [1,150]
.
The value rows is made up of an ID in the first column and zeroes or ones other columns.
The goal is to extract pairs of the IDs and the names of the columns where it's non-zero, e.g. given the two rows of data above the desired output is:
1000001 4
1000001 9
1000002 7
1000002 8
There are 800,000 rows in the data, so I'll avoid pandas
and use sframe
, I've tried:
import turicreate as tc
from tqdm import tqdm
df = tc.SFrame('data.tsv')
with open('ground_truth.non-zeros.tsv', 'w') as fout:
for i in tqdm(range(len(df))):
for j in range(1,151):
if df[i][str(j)]:
print(df[i]['doc_id/query_id', j)
Is there a simpler way to extract the non-zeros values and the row IDs?
Pandas solutions or other dataframe solutions are appreciated too! Please do state the limitations if known and if any =)
Upvotes: 2
Views: 293
Reputation: 323366
Here is one way based on numpy
, I think should slightly speed up the whole process
t,v=np.where(df.iloc[:,1:]==1)
list(zip(df['doc_id/query_id'].iloc[t],df.columns[v+1]))
Out[135]: [(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')]
Upvotes: 2
Reputation: 13106
A non-pandas answer, you could just iterate over your file, and grab the columns where necessary:
results = []
with open('yourfile.csv') as fh:
headers = next(fh).split()
for line in fh:
_id, *line = line.split()
non_zero = [{_id: header} for header, val in zip(headers[1:], line) if val!="0"]
results.extend(non_zero)
# Where you now have the option to throw it into whatever data structure you want
results
[{'1000001': '4'}, {'1000001': '9'}, {'1000002': '7'}, {'1000002': '8'}]
This way you don't load the entire file into memory, you only grab what you need, though you do pay for the list.extend
operation
Upvotes: 1
Reputation: 403012
Here's a pandaic approach using stack
and query
:
(df.set_index('doc_id/query_id')
.stack()
.to_frame('tmp')
.query('tmp == 1')
.index
.values)
array([(1000001, '4'), (1000001, '9'), (1000002, '7'), (1000002, '8')],
dtype=object)
This is an elegance first, performance later approach.
You can also start with numpy, this is for max performance.
arr = np.loadtxt(filename, skiprows=1, usecols=np.r_[1:151], dtype=int)
index = np.loadtxt(filename, skiprows=1, usecols=[0], dtype=int)
r, c = np.where(arr)
np.column_stack([index[r], c+1])
array([[1000001, 4],
[1000001, 9],
[1000002, 7],
[1000002, 8]])
Upvotes: 2