Reputation: 31
I am trying to run Python Twarc hydrate on a very large file of 2,339,076 records but it keeps freezing. I have tried the script on a smaller data set and it works fine. My question is, does Twarc have a maximum number of rows it can process? If so what is it? Do I need to separate my data in to smaller subsections?
I have tried the terminal command:
twarc2 hydrate 2020-03-22_clean-dataset_csv.csv > hydrated.jsonl
I have tried it on a smaller file and it works fine
I have tried searching to find whether the is a limit to the number of rows Twarc can process but I can't find an answer.
Upvotes: 2
Views: 54
Reputation: 6661
You can use the built-in split
split -n l/10 -d 2020-03-22_clean-dataset_csv.csv subset_
This will create 10 files with names like subset_00
, subset_01
, etc., each containing approximately one-tenth of the original data.
You can then run Twarc hydrate on each subset separately, like this:
twarc2 hydrate subset_00 > hydrated_00.jsonl
And then you can read the different .jsonl
files one by one or look for some way to merge them. (warning, untested as I cannot install twarc2)
Upvotes: 0