Reputation: 3675
I have the following file like this:
2 qid:1 1:0.32 2:0.50 3:0.78 4:0.02 10:0.90
5 qid:2 2:0.22 5:0.34 6:0.87 10:0.56 12:0.32 19:0.24 20:0.55
...
he structure is follwoing like that:
output={} rel=2 qid=1 features={} # the feature list "1:0.32 2:0.50 3:0.78 4:0.02 10:0.90" output.append([rel,qid,features]) ... How can I write my python code to load the data, thanks
Upvotes: 0
Views: 232
Reputation: 7592
It looks like your input files are in svmlight format. If this is true, then there's a parser included as part of scikit-learn that might be handy to use -- see the source at:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/svmlight_format.py#L32
Upvotes: 0
Reputation: 309
The following should work nicely and leaves your data in a handy format:
regexp = r"(\d+)\s+qid:(\d+)\s+(.+)"
data = np.fromregex(file_name, regexp,
dtype=[('rel', int), ('qid', int), ('features', object)])
From here you can select rel, qid or features by calling:
>>> data['rel']
array([2, 5])
>>> data['qid']
array([1, 2])
>>> data['features']
array(['1:0.32 2:0.50 3:0.78 4:0.02 10:0.90',
'2:0.22 5:0.34 6:0.87 10:0.56 12:0.32 19:0.24 20:0.55'], dtype=object)
Upvotes: 0
Reputation: 818
For reading use something like this (data is in file 'fname'):
f = open(fname)
lines = f.readlines(f)
for line in lines:
elements = line.split(' ')
rel = int(elements[0])
qid = int(elements[1].split(':')[1])
featurelist = elements[2:]
# get the various features again with splitting at ':'
# you get the idea ...
Upvotes: 1