Balajee Venkatesh
Balajee Venkatesh

Reputation: 1099

Listing files by particular expression from GCS in Java

Have anyone achieved this functionality before ? It's equivalent to ls -ltr *xyz* in unix and I would like to achieve the same in my cloud dataflow code. Any lead would be appreciated.

Thank you.

Upvotes: 1

Views: 5476

Answers (3)

Tuxdude
Tuxdude

Reputation: 49473

It is possible to do this filtering on the client side. Here is an example using the google-cloud java client library to access the Google Cloud Storage APIs.

The example below lists all files in the root directory of the bucket which matches the given regular expression pattern.

I've used regular expressions instead of the glob pattern that shell commands like ls support since regular expressions are more flexible.

I would recommend you go through the java library documentation for google-cloud.

Example

    import com.google.api.gax.paging.Page;
    import com.google.cloud.storage.Blob;
    import com.google.cloud.storage.Storage;
    import com.google.cloud.storage.Storage.BlobListOption;
    import com.google.cloud.storage.StorageOptions;
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import java.util.regex.Pattern;
    
    /**
     * An example which lists the files in the specified GCS bucket matching the
     * specified regular expression pattern.
     *
     * <p>Run it as PROGRAM_NAME <BUCKET_NAME> <REGEX_MATCH_PATTERN>
     */
    public class ListBlobsSample {
      public static void main(String[] args) throws IOException {
        // Instantiates a Storage client
        Storage storage = StorageOptions.getDefaultInstance().getService();
    
        // The name of the GCS bucket
        String bucketName = args[0];
        // The regular expression for matching blobs in the GCS bucket.
        // Example: '.*abc.*'
        String matchExpr = args[1];
    
        List<String> results = listBlobs(storage, bucketName, Pattern.compile(matchExpr));
        System.out.println("Results: " + results.size() + " items.");
        for (String result : results) {
          System.out.println("Blob: " + result);
        }
      }
    
      // Lists all blobs in the bucket matching the expression.
      // Specify a regex here. Example: '.*abc.*'
      private static List<String> listBlobs(Storage storage, String bucketName, Pattern matchPattern)
          throws IOException {
        List<String> results = new ArrayList<>();
    
        // Only list blobs in the current directory
        // (otherwise you also get results from the sub-directories).
        BlobListOption listOptions = BlobListOption.currentDirectory();
        Page<Blob> blobs = storage.list(bucketName, listOptions);
        for (Blob blob : blobs.iterateAll()) {
          if (!blob.isDirectory() && matchPattern.matcher(blob.getName()).matches()) {
            results.add(blob.getName());
          }
        }
        return results;
      }
    }

Using just prefix matching

If you instead need to match just prefixes in the object names, Objects: list API supports it.

You need to specify the prefix query parameter in the request when doing GET https://www.googleapis.com/storage/v1/b/bucket/o. This is also supported using the java client library (you will have to specify it while building the BlobListOption you pass to storage.list()).

prefix

string

Filter results to objects whose names begin with this prefix.

gsutil

gsutil supports such queries and it does the filtering solely on the client side (for some cases it issues multiple requests too).

Upvotes: 2

Fza
Fza

Reputation: 1003

The following may not be exactly helpful for your use case, but if you are looking to narrow down the results by a certain prefix and then apply regex to match your final regex.

 Storage storage = StorageOptions.getDefaultInstance().getService();
 Bucket bucket = storage.get(bucketName)
 BlobListOption blobListOption = Storage.BlobListOption.prefix(prefixPattern)
 for (Blob blob : bucket.list(blobListOption).iterateAll()) {
    System.out.println(blob);
 }

Upvotes: 0

Mike Schwartz
Mike Schwartz

Reputation: 12145

GCS supports prefix queries, you can efficiently list xyz*; but to list xyz you would have to list the entire bucket and filter at the client.

Upvotes: 0

Related Questions