Reputation: 1029
Got different files in different folders. need to merge them using pyspark. merging can happen using below code but needs to read the files present in different folders
sc.textFile(<path>).coalesce(1).saveAsTextFile(<path>)
example
/user/home/m_f012345/part0000, part0001, part0002
/user/home/m_f00120/part0000, part0001, part0002
/user/home/m_f123120/part0000, part0001, part0002
after merging files present in each folder
/user/home/m_f012345/part0000
/user/home/m_f00120/part0000
/user/home/m_f123120/part0000
Note: i might be having folders more than 50 and we are not using any format for folders. these are random folders
Upvotes: 0
Views: 410
Reputation: 1029
Above scenario is possible with below code.
from pyspark import SparkContext,SparkConf
from pyspark.sql.context import SQLContext
import os
import time
import shutil
conf = SparkConf().setAppName("FileSystem").setMaster("local")
sc=SparkContext(conf=conf)
sqlContext=SQLContext(sc)
path ="/user/home/"
dummy =path+"test"
v = os.listdir(path)
dir =[]
for i in v:
dir.append(path+i)
count=0
for i in dir:
sc.textFile(i).coalesce(1).saveAsTextFile(dummy+str(count))
shutil.move(dummy+str(count)+"/part-00000",i)
shutil.rmtree(dummy+str(count))
count+=1
Upvotes: 1