Reputation: 489
I have two directories in my linux system, /dir
and /dir2
Both have more than 4000 JSON files. The JSON content of every file is like
{
"someattribute":"someValue",
"url":[
"https://www.someUrl.com/xyz"
],
"someattribute":"someValue"
}
Note that url is an array, but it always contains one element (the url).
The url makes the file unique. If there is a file with the same url in /dir
and /dir2
then it's a duplicate and it needs to be deleted.
I want to automate this operation either using a shell command preferrably. Any opinion how I should go about it?
Upvotes: 3
Views: 1310
Reputation: 50795
Use jq to get a list of duplicates:
jq -nrj '[
foreach inputs.url as [$url] ({};
.[$url] += 1;
if .[$url] > 1 then input_filename
else empty end
)
] | join("\u0000")' /{dir1,dir2}/*.json
And to remove them, pipe above command's output to xargs:
xargs -0 rm --
Upvotes: 5
Reputation: 4900
Here is a quick and simple awk script that does all the work from base dir.
The awk script named script1.awk
/https/{
if ($1 in urlArr) {
cmd = "rm " FILENAME;
print cmd;
//system(cmd);
} else {
urlArr[$1] = FILENAME;
}
}
Initially run the script with the following command:
awk -f script1.awk dir{1,}/*.json
When ready to remove the duplicate json files, just uncomment the 5th line (line containing system(cmd)
). And run again.
Here are some explanations:
The awk
command runs the script script1.awk
on all json files in sub directory dir and dir1.
The script traverse each file, extract the URL text having https into variable $1.
If variable $1 already exist in associative array urlArr print/remove the file.
Else add current file into associative array urlArr.
Hope you like this simple solution.
Upvotes: 0
Reputation: 11050
You can use the following Java approach to achieve this:
Set<String> urls = new HashSet<>();
try (Stream<Path> paths = Files.list(Paths.get("/path/to/your/folder"))) {
paths
.map(path -> new FileInfo(path, extractUrl(path)))
.filter(info -> info.getUrl() != null)
.filter(info -> !urls.add(info.getUrl()))
.forEach(info -> {
try {
Files.delete(info.getPath());
} catch (IOException e) {
e.printStackTrace();
}
});
} catch (IOException e) {
e.printStackTrace();
}
This uses the following FileInfo
class:
public class FileInfo {
private Path path;
private String url;
// constructor and getter
}
First of all it reads all files in the given directory and extracts the URL. It filters all duplicates with the help of the HashSet
. At the end all files containing duplicate URLs are going to be deleted.
There are multiple options to extract the url
from each file:
Quick and dirty using a regex:
private String extractUrl(Path path) {
try {
String content = String.join("\n", Files.readAllLines(path));
Pattern pattern = Pattern.compile("\"url\".+\\s+\"(?<url>[^\\s\"]+)\"");
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
return matcher.group("url");
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
A better solution would be using a JsonParser Library like Jackson:
private String extractUrl(Path path) {
try (BufferedReader reader = Files.newBufferedReader(path)) {
ObjectMapper mapper = new ObjectMapper();
MyObject object = mapper.readValue(reader, MyObject.class);
return object.getUrls().stream().findFirst().orElse(null);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
This uses an Object representation of the file content:
public class MyObject {
@JsonProperty("url")
private List<String> urls;
// getter and setter
}
But at the end, the most performant solution probably would be to use a shell script.
Upvotes: 1
Reputation: 52549
Here's a quick and dirty bash script that uses jq to extract the URL from the json files, and awk to detect and delete duplicates:
#!/bin/bash
rm -f urls-dir1.txt urls-dir2.txt
for file in dir1/*.json; do
printf "%s\t%s\n" "$file" $(jq '.url[0]' "$file") >> urls-dir1.txt
done
for file in dir2/*.json; do
printf "%s\t%s\n" "$file" $(jq '.url[0]' "$file") >> urls-dir2.txt
done
awk -F $'\t' 'FNR == NR { urls[$2] = 1; next }
$2 in urls { system("rm -f \"" $1 "\"") }' urls-dir1.txt urls-dir2.txt
rm -f urls-dir1.txt urls-dir2.txt
It assumes that dir2 has the files that are to be deleted as duplicates and the ones in dir1 should be untouched.
Upvotes: 1