EmJ
EmJ

Reputation: 4608

How to quickly convert a pandas dataframe to a list of tuples

I have a pandas dataframe as follows.

thi        0.969378
text       0.969378
is         0.969378
anoth      0.699030
your       0.497120
first      0.497120
book       0.497120
third      0.445149
the        0.445149
for        0.445149
analysi    0.445149

I want to convert it to a list of tuples as follows.

[["this", 0.969378], ["text", 0.969378], ..., ["analysi", 0.445149]]

My code is as follows.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer

def tokenize(text):
    tokens = word_tokenize(text)
    stems = []
    for item in tokens: stems.append(PorterStemmer().stem(item))
    return stems

# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
print(top_words)

I tried the following two options.

list(zip(*map(top_words.get, top_words)))

I got the error as TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.9693779251346359] of <class 'float'>

list(top_words.itertuples(index=True))

I got the error as AttributeError: 'Series' object has no attribute 'itertuples'.

Please let me know a quick way of doing this in pandas.

I am happy to provide more details if needed.

Upvotes: 1

Views: 87

Answers (1)

jezrael
jezrael

Reputation: 862781

Use zip by index with map tuples to lists:

a = list(map(list,zip(top_words.index,top_words)))

Or convert index to column, convert to nupy array and then to lists:

a = top_words.reset_index().to_numpy().tolist()

print (a)
[['thi', 0.9693780000000001], ['text', 0.9693780000000001], 
 ['is', 0.9693780000000001], ['anoth', 0.69903], 
 ['your', 0.49712], ['first', 0.49712], ['book', 0.49712],
 ['third', 0.44514899999999996], ['the', 0.44514899999999996],
 ['for', 0.44514899999999996], ['analysi', 0.44514899999999996]]

Upvotes: 1

Related Questions