Reputation: 95
I have more than 15M tweets and I need to merger the ID and Text after dropping duplicates. I need most efficient way to do this as it is taking very long to complete?
frames = []
missed = 0
for q in query_list:
hashtag = q + '.csv'
try:
file_data = pd.read_csv(path + hashtag ,encoding='utf-8')
frames.append(file_data)
except:
missed+= 1
continue
df = pd.concat(frames)
df = df[['id','text']]
df = df.drop_duplicates()
df.to_csv('row_tweets.csv',index=False)
Upvotes: 1
Views: 56
Reputation: 1415
If you want unique pairs of (id, text)
, I'd just do it in pure python using set
for easy de-duplication, and csv readers/writers:
import csv
id_text_pairs = set() # set of (id, text) pairs
missed = 0
for q in query_list:
hashtag = q + '.csv'
try:
with open(path + hashtag, 'r') as infile:
reader = csv.DictReader(infile)
for row in reader:
id_text_pairs.add( (row['id'], row['text']) ) # this won't add duplicates
except:
missed += 1
continue
with open('row_tweets.csv', 'w') as outfile:
col_names = ['id', 'text']
writer = csv.DictWriter(outfile, fieldnames=col_names)
writer.writeheader() # First line is the 'id,text' header
for id, text in id_text_pairs:
writer.writerow({'id': id, 'text': text}) # write each id,text pair
That should do it, and I believe will be more efficient in de-duping than a huge dataframe call at the end. Note that if your text
's contain commas, you might want to output in tab-delimited format using the DictWriter argument delimiter='\t'
, or the quotechar
and quoting
arguments, check out the csv documentation here.
Upvotes: 3