user11613770
user11613770

Reputation: 45

Python pandas unique values on chunksized file

Hi i have huge tsv file that i need to work with so i need to chunksize it so i used code like this

MyList = []
Chunksize = 1000000
for chunk in pd.read_csv("wiki_editor_months.201508.tsv", sep="\t", chunksize=Chunksize):
    MyList.append(chunk)

then i wanted to search unique values in one of the columns(wiki) the only idea i had is this code

MyList[0].wiki.unique()

using this code is kinda problematic becouse at one time i can only search one chunk (there are 43 of them) and then there are duplicates in diffrent chunks, does anyone has idea how to use .unique on this chunksized file not on one chunk at the time?

Upvotes: 0

Views: 132

Answers (1)

Kavin Dsouza
Kavin Dsouza

Reputation: 989

See if this solves your problem.

unique_values = set()
chunk_size = 1000000
for chunk in pd.read_csv("wiki_editor_months.201508.tsv", sep="\t", chunksize=chunk_size):
    unique_values = unique_values | set(chunk.wiki.unique())

Upvotes: 1

Related Questions