Reputation: 310
I'm trying to read csv files from a directory with a particular pattern
I want to match all the files with that contains this string "logs_455DD_33
t should match anything like "
machine_logs_455DD_33.csv
logs_455DD_33_2018.csv
machine_logs_455DD_33_2018.csv
I've tried the following regex but it doesn't match files with the above format .
file = "hdfs://data/logs/{*}logs_455DD_33{*}.csv"
df = spark.read.csv(file)
Upvotes: 1
Views: 6849
Reputation: 21
I had to do a similar thing in my pyspark program where I need to pick a file in HDFS by cycle_date and I did like this:
df=spark.read.parquet(pathtoFile + "*" + cycle_date + "*")
Upvotes: 1
Reputation: 1030
You could use a subprocess to liste files in hdfs and grep these files :
import subprocess
# Define path and pattern to match
dir_in = "data/logs"
your_pattern = "logs_455DD_33"
# Specify your subprocess
args = "hdfs dfs -ls "+dir_in+" | awk '{print $8}' | grep "+your_pattern
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
# Get output and split it
s_output, s_err = proc.communicate()
l_file = s_output.split('\n')
# Read files
for file in l_file :
df = spark.read.csv(file)
Upvotes: 0