Sibgha
Sibgha

Reputation: 489

Delete duplicate JSON file based on one of the attributes

I have two directories in my linux system, /dir and /dir2

Both have more than 4000 JSON files. The JSON content of every file is like

{
   "someattribute":"someValue",
   "url":[
      "https://www.someUrl.com/xyz"
   ],
   "someattribute":"someValue"
}

Note that url is an array, but it always contains one element (the url).

The url makes the file unique. If there is a file with the same url in /dir and /dir2 then it's a duplicate and it needs to be deleted.

I want to automate this operation either using a shell command preferrably. Any opinion how I should go about it?

Upvotes: 3

Views: 1310

Answers (4)

oguz ismail
oguz ismail

Reputation: 50795

Use to get a list of duplicates:

jq -nrj '[
  foreach inputs.url as [$url] ({};
    .[$url] += 1;
    if .[$url] > 1 then input_filename
    else empty end
  )
] | join("\u0000")' /{dir1,dir2}/*.json

And to remove them, pipe above command's output to xargs:

xargs -0 rm --

Upvotes: 5

Dudi Boy
Dudi Boy

Reputation: 4900

Here is a quick and simple awk script that does all the work from base dir.

The awk script named script1.awk

/https/{
    if ($1 in urlArr) {
        cmd = "rm " FILENAME;
        print cmd;
        //system(cmd);
    } else {
        urlArr[$1] = FILENAME;
    }
}

Initially run the script with the following command:

awk -f script1.awk dir{1,}/*.json

When ready to remove the duplicate json files, just uncomment the 5th line (line containing system(cmd)). And run again.

Here are some explanations:

  1. The awk command runs the script script1.awk on all json files in sub directory dir and dir1.

  2. The script traverse each file, extract the URL text having https into variable $1.

    If variable $1 already exist in associative array urlArr print/remove the file.

    Else add current file into associative array urlArr.

Hope you like this simple solution.

Upvotes: 0

Samuel Philipp
Samuel Philipp

Reputation: 11050

You can use the following Java approach to achieve this:

Set<String> urls = new HashSet<>();
try (Stream<Path> paths = Files.list(Paths.get("/path/to/your/folder"))) {
    paths
            .map(path -> new FileInfo(path, extractUrl(path)))
            .filter(info -> info.getUrl() != null)
            .filter(info -> !urls.add(info.getUrl()))
            .forEach(info -> {
                try {
                    Files.delete(info.getPath());
                } catch (IOException e) {
                    e.printStackTrace();
                }
            });
} catch (IOException e) {
    e.printStackTrace();
}

This uses the following FileInfo class:

public class FileInfo {
    private Path path;
    private String url;
    // constructor and getter
}

First of all it reads all files in the given directory and extracts the URL. It filters all duplicates with the help of the HashSet. At the end all files containing duplicate URLs are going to be deleted.

There are multiple options to extract the url from each file:

Quick and dirty using a regex:

private String extractUrl(Path path) {
    try {
        String content = String.join("\n", Files.readAllLines(path));
        Pattern pattern = Pattern.compile("\"url\".+\\s+\"(?<url>[^\\s\"]+)\"");
        Matcher matcher = pattern.matcher(content);
        if (matcher.find()) {
            return matcher.group("url");
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    return null;
}

A better solution would be using a JsonParser Library like Jackson:

private String extractUrl(Path path) {
    try (BufferedReader reader = Files.newBufferedReader(path)) {
        ObjectMapper mapper = new ObjectMapper();
        MyObject object = mapper.readValue(reader, MyObject.class);
        return object.getUrls().stream().findFirst().orElse(null);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return null;
}

This uses an Object representation of the file content:

public class MyObject {
    @JsonProperty("url")
    private List<String> urls;
    // getter and setter
}

But at the end, the most performant solution probably would be to use a shell script.

Upvotes: 1

Shawn
Shawn

Reputation: 52549

Here's a quick and dirty bash script that uses jq to extract the URL from the json files, and awk to detect and delete duplicates:

#!/bin/bash

rm -f urls-dir1.txt urls-dir2.txt

for file in dir1/*.json; do
    printf "%s\t%s\n" "$file" $(jq '.url[0]' "$file") >> urls-dir1.txt
done
for file in dir2/*.json; do
    printf "%s\t%s\n" "$file" $(jq '.url[0]' "$file") >> urls-dir2.txt
done

awk -F $'\t' 'FNR == NR  { urls[$2] = 1; next }
              $2 in urls { system("rm -f \"" $1 "\"") }' urls-dir1.txt urls-dir2.txt

rm -f urls-dir1.txt urls-dir2.txt

It assumes that dir2 has the files that are to be deleted as duplicates and the ones in dir1 should be untouched.

Upvotes: 1

Related Questions