Tartaglia
Tartaglia

Reputation: 1041

Save dataframe as txt with specific delimiters

I would like to save my dataframe in txt format with specific delimiters (libsvm format), to look like this:

1 qid:0 0:1.465648768921554 1:-0.2257763004865357 2:0.06752820468792384 3:-1.424748186213457 4:-0.5443827245251827
1 qid:0 0:1.465648768921554 1:-0.2257763004865357 2:0.06752820468792384 3:-1.424748186213457 4:-0.5443827245251827
2 qid:0 0:0.7384665799954104 1:0.1713682811899705 2:-0.1156482823882405 3:-0.3011036955892888 4:-1.478521990367427

Notice that first 2 columns are separated by space, and then separated by colons, where the integer before the colon is an identifier of that column.

This is my current dataset:

data = {'label': [2,3,2],
        'qid': ['qid:0', 'qid:1','qid:0'],
       '0': [0, 0, 0],
       '0': [0.4967, 0.4967,0.4967],
       '1': [1,1,1],
       '1': [0.4967, 0.4967,0.4967],
       '2': [2,2,2],
       '2': [0.4967, 0.4967,0.4967],
       '3': [3,3,3],
       '2': [0.4967, 0.4967,0.4967],
       '4': [4,4,4]}

df = pd.DataFrame(data)

Is there a way to save this as txt to match that format exactly?

For context, my machine learning model was trained on a dataset in this specific txt format, and I need to match it to use it for my own dataset.

Upvotes: 0

Views: 40

Answers (1)

Tartaglia
Tartaglia

Reputation: 1041

A similar question was answered here, there is a specific sklearn method for this: dump_svmlight_file.

For this particular case, you need to add quid and remove the modifications to get the the qid to be just numeric integers and remove the additional integer columns:

from sklearn.datasets import dump_svmlight_file

def df_to_libsvm(df: pd.DataFrame):
    x = df.drop(columns = ['label','qid'], axis=1)
    y = df['label']
    query_id  = df['qid']
    dump_svmlight_file(X=x, y=y, query_id= query_id, f='libsvm.dat', zero_based=True)

df_to_libsvm(df)

Upvotes: 1

Related Questions