frogger
frogger

Reputation: 31

Why is Python Twarc2 freezing on a large file?

I am trying to run Python Twarc hydrate on a very large file of 2,339,076 records but it keeps freezing. I have tried the script on a smaller data set and it works fine. My question is, does Twarc have a maximum number of rows it can process? If so what is it? Do I need to separate my data in to smaller subsections?

I have tried the terminal command:

twarc2 hydrate 2020-03-22_clean-dataset_csv.csv > hydrated.jsonl

I have tried it on a smaller file and it works fine

I have tried searching to find whether the is a limit to the number of rows Twarc can process but I can't find an answer.

Upvotes: 2

Views: 54

Answers (1)

Caridorc
Caridorc

Reputation: 6661

You can use the built-in split

split -n l/10 -d 2020-03-22_clean-dataset_csv.csv subset_

This will create 10 files with names like subset_00, subset_01, etc., each containing approximately one-tenth of the original data.

You can then run Twarc hydrate on each subset separately, like this:

twarc2 hydrate subset_00 > hydrated_00.jsonl

And then you can read the different .jsonl files one by one or look for some way to merge them. (warning, untested as I cannot install twarc2)

Upvotes: 0

Related Questions