learner
learner

Reputation: 905

Mahout : To read a custom input file

I was playing with Mahout and found that the FileDataModel accepts data in the format

     userId,itemId,pref(long,long,Double).

I have some data which is of the format

     String,long,double 

What is the best/easiest method to work with this dataset on Mahout?

Upvotes: 6

Views: 3673

Answers (3)

KlwntSingh
KlwntSingh

Reputation: 1092

userId and itemId can be string, so this is the CustomFileDataModel which will convert your string into integer and will keep the map (String,Id) in memory; after recommendations you can get string from id.

Upvotes: 3

Eyal
Eyal

Reputation: 3513

One way to do this is by creating an extension of FileDataModel. You'll need to override the readUserIDFromString(String value) method to use some kind of resolver do the conversion. You can use one of the implementations of IDMigrator, as Sean suggests.

For example, assuming you have an initialized MemoryIDMigrator, you could do this:

@Override
protected long readUserIDFromString(String stringID) {
    long result = memoryIDMigrator.toLongID(stringID); 
    memoryIDMigrator.storeMapping(result, stringID);
    return result;
}

This way you could use memoryIDMigrator to do the reverse mapping, too. If you don't need that, you can just hash it the way it's done in their implementation (it's in AbstractIDMigrator).

Upvotes: 3

Spike Gronim
Spike Gronim

Reputation: 6182

Assuming that your input fits in memory, loop through it. Track the ID for each string in a dictionary. If it does not fit in memory, use sort and then group by to accomplish the same idea.

In python:

import sys

import sys

next_id = 0
str_to_id = {}
for line in sys.stdin:
    fields = line.strip().split(',')
    this_id = str_to_id.get(fields[0])
    if this_id is None:
        next_id += 1
        this_id = next_id
        str_to_id[fields[0]] = this_id
    fields[0] = str(this_id)

    print ','.join(fields)

Upvotes: 1

Related Questions