Rabab Alkhalifa
Rabab Alkhalifa

Reputation: 95

Efficient and Fast merger for multiple .CSV files

I have more than 15M tweets and I need to merger the ID and Text after dropping duplicates. I need most efficient way to do this as it is taking very long to complete?

   frames = []
   missed = 0
   for q in query_list:
       hashtag = q + '.csv'
       try:
            file_data = pd.read_csv(path + hashtag ,encoding='utf-8')
            frames.append(file_data)
       except:
            missed+= 1
            continue

   df = pd.concat(frames)
   df = df[['id','text']]
   df = df.drop_duplicates()
   df.to_csv('row_tweets.csv',index=False)

Upvotes: 1

Views: 56

Answers (1)

Sam
Sam

Reputation: 1415

If you want unique pairs of (id, text), I'd just do it in pure python using set for easy de-duplication, and csv readers/writers:

import csv
id_text_pairs = set()  # set of (id, text) pairs
missed = 0
for q in query_list:
    hashtag = q + '.csv'
    try:
        with open(path + hashtag, 'r') as infile:
            reader = csv.DictReader(infile)
            for row in reader:
                id_text_pairs.add( (row['id'], row['text']) ) # this won't add duplicates
    except:
        missed += 1
        continue

with open('row_tweets.csv', 'w') as outfile:
    col_names = ['id', 'text']
    writer = csv.DictWriter(outfile, fieldnames=col_names)
    writer.writeheader() # First line is the 'id,text' header
    for id, text in id_text_pairs:
        writer.writerow({'id': id, 'text': text})  # write each id,text pair

That should do it, and I believe will be more efficient in de-duping than a huge dataframe call at the end. Note that if your text's contain commas, you might want to output in tab-delimited format using the DictWriter argument delimiter='\t', or the quotechar and quoting arguments, check out the csv documentation here.

Upvotes: 3

Related Questions