LUZO
LUZO

Reputation: 1029

How to merge files contains in different folders using pyspark

Got different files in different folders. need to merge them using pyspark. merging can happen using below code but needs to read the files present in different folders

sc.textFile(<path>).coalesce(1).saveAsTextFile(<path>)

example

/user/home/m_f012345/part0000, part0001, part0002
/user/home/m_f00120/part0000, part0001, part0002
/user/home/m_f123120/part0000, part0001, part0002

after merging files present in each folder

/user/home/m_f012345/part0000
/user/home/m_f00120/part0000
/user/home/m_f123120/part0000

Note: i might be having folders more than 50 and we are not using any format for folders. these are random folders

Upvotes: 0

Views: 410

Answers (1)

LUZO
LUZO

Reputation: 1029

Above scenario is possible with below code.

from pyspark import SparkContext,SparkConf
from pyspark.sql.context import SQLContext
import os
import time
import shutil
conf = SparkConf().setAppName("FileSystem").setMaster("local")
sc=SparkContext(conf=conf)
sqlContext=SQLContext(sc)
path ="/user/home/"
dummy =path+"test"

v = os.listdir(path)
dir =[]
for i in v:
    dir.append(path+i) 
count=0
for i in dir:
    sc.textFile(i).coalesce(1).saveAsTextFile(dummy+str(count))
    shutil.move(dummy+str(count)+"/part-00000",i)
    shutil.rmtree(dummy+str(count))
    count+=1

Upvotes: 1

Related Questions