Reputation: 1041
I would like to save my dataframe in txt format with specific delimiters (libsvm format), to look like this:
1 qid:0 0:1.465648768921554 1:-0.2257763004865357 2:0.06752820468792384 3:-1.424748186213457 4:-0.5443827245251827
1 qid:0 0:1.465648768921554 1:-0.2257763004865357 2:0.06752820468792384 3:-1.424748186213457 4:-0.5443827245251827
2 qid:0 0:0.7384665799954104 1:0.1713682811899705 2:-0.1156482823882405 3:-0.3011036955892888 4:-1.478521990367427
Notice that first 2 columns are separated by space, and then separated by colons, where the integer before the colon is an identifier of that column.
This is my current dataset:
data = {'label': [2,3,2],
'qid': ['qid:0', 'qid:1','qid:0'],
'0': [0, 0, 0],
'0': [0.4967, 0.4967,0.4967],
'1': [1,1,1],
'1': [0.4967, 0.4967,0.4967],
'2': [2,2,2],
'2': [0.4967, 0.4967,0.4967],
'3': [3,3,3],
'2': [0.4967, 0.4967,0.4967],
'4': [4,4,4]}
df = pd.DataFrame(data)
Is there a way to save this as txt to match that format exactly?
For context, my machine learning model was trained on a dataset in this specific txt format, and I need to match it to use it for my own dataset.
Upvotes: 0
Views: 40
Reputation: 1041
A similar question was answered here, there is a specific sklearn method for this: dump_svmlight_file.
For this particular case, you need to add quid and remove the modifications to get the the qid to be just numeric integers and remove the additional integer columns:
from sklearn.datasets import dump_svmlight_file
def df_to_libsvm(df: pd.DataFrame):
x = df.drop(columns = ['label','qid'], axis=1)
y = df['label']
query_id = df['qid']
dump_svmlight_file(X=x, y=y, query_id= query_id, f='libsvm.dat', zero_based=True)
df_to_libsvm(df)
Upvotes: 1