Python pandas unique values on chunksized file

Question

Hi i have huge tsv file that i need to work with so i need to chunksize it so i used code like this

MyList = []
Chunksize = 1000000
for chunk in pd.read_csv("wiki_editor_months.201508.tsv", sep="	", chunksize=Chunksize):
    MyList.append(chunk)

then i wanted to search unique values in one of the columns(wiki) the only idea i had is this code

MyList[0].wiki.unique()

using this code is kinda problematic becouse at one time i can only search one chunk (there are 43 of them) and then there are duplicates in diffrent chunks, does anyone has idea how to use .unique on this chunksized file not on one chunk at the time?

Kavin Dsouza · Accepted Answer

See if this solves your problem.

unique_values = set()
chunk_size = 1000000
for chunk in pd.read_csv("wiki_editor_months.201508.tsv", sep="	", chunksize=chunk_size):
    unique_values = unique_values | set(chunk.wiki.unique())

Python pandas unique values on chunksized file

Answers (1)

Related Questions