Reputation: 45
Hi i have huge tsv file that i need to work with so i need to chunksize it so i used code like this
MyList = []
Chunksize = 1000000
for chunk in pd.read_csv("wiki_editor_months.201508.tsv", sep="\t", chunksize=Chunksize):
MyList.append(chunk)
then i wanted to search unique values in one of the columns(wiki) the only idea i had is this code
MyList[0].wiki.unique()
using this code is kinda problematic becouse at one time i can only search one chunk (there are 43 of them) and then there are duplicates in diffrent chunks, does anyone has idea how to use .unique on this chunksized file not on one chunk at the time?
Upvotes: 0
Views: 132
Reputation: 989
See if this solves your problem.
unique_values = set()
chunk_size = 1000000
for chunk in pd.read_csv("wiki_editor_months.201508.tsv", sep="\t", chunksize=chunk_size):
unique_values = unique_values | set(chunk.wiki.unique())
Upvotes: 1