Common Crawl : pyspark, unable to use it

Question

As part of an internship, I must download Hadoop and Spark and test them on some data of Common Crawl. I tried to follow the steps of this page https://github.com/commoncrawl/cc-pyspark#get-sample-data (I install Spark 3.0.0 on my computer) but when I try it on my computer (I use Ubuntu) I have a lot of errors and it doesn't seem to work.
Especially, when I execute the programm "serveur_count.py" I have a lot of lines where it's written something like this: Failed to open /home/root/CommonCrawl/... and the program suddently finish with written: .MapOutputTrackerMasterEndpoint stopped. Have you any idea how to correct this? (it the first time that I use theses softwares) Sorry for my English and thank you in advance for your response

Common Crawl : pyspark, unable to use it

Answers (0)

Related Questions