Reputation: 1827
I have a text file that contains URLs and emails. I need to extract all of them from the file. Each URL and email can be found more then once, but result shouldn't contain duplicates. I can extract all URLs using the following code:
Files.lines(filePath).
.map(urlPattern::matcher)
.filter(Matcher::find)
.map(Matcher::group)
.distinct();
I can extract all emails using the following code:
Files.lines(filePath).
.map(emailPattern::matcher)
.filter(Matcher::find)
.map(Matcher::group)
.distinct();
Can I extract all URLs and emails reading the stream returned by Files.lines(filePath)
only one time?
Something like splitting stream of lines to stream of URLs and stream of emails.
Upvotes: 13
Views: 1365
Reputation: 298539
You can perform the matching within a Collector
:
Map<String,Set<String>> map=Files.lines(filePath)
.collect(HashMap::new,
(hm,line)-> {
Matcher m=emailPattern.matcher(line);
if(m.matches())
hm.computeIfAbsent("mail", x->new HashSet<>()).add(line);
else if(m.usePattern(urlPattern).matches())
hm.computeIfAbsent("url", x->new HashSet<>()).add(line);
},
(m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
(s1,s2)->{s1.addAll(s2); return s1;}))
);
Set<String> mail=map.get("mail"), url=map.get("url");
Note that this can easily be adapted to find multiple matches within a line:
Map<String,Set<String>> map=Files.lines(filePath)
.collect(HashMap::new,
(hm,line)-> {
Matcher m=emailPattern.matcher(line);
while(m.find())
hm.computeIfAbsent("mail", x->new HashSet<>()).add(m.group());
m.usePattern(urlPattern).reset();
while(m.find())
hm.computeIfAbsent("url", x->new HashSet<>()).add(m.group());
},
(m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
(s1,s2)->{s1.addAll(s2); return s1;}))
);
Upvotes: 4
Reputation: 100309
You can use partitioningBy
collector, though it's still not very elegant solution.
Map<Boolean, List<String>> map = Files.lines(filePath)
.filter(str -> urlPattern.matcher(str).matches() ||
emailPattern.matcher(str).matches())
.distinct()
.collect(Collectors.partitioningBy(str -> urlPattern.matcher(str).matches()));
List<String> urls = map.get(true);
List<String> emails = map.get(false);
If you don't want to apply regexp twice, you can make it using the intermediate pair object (for example, SimpleEntry
):
public static String classify(String str) {
return urlPattern.matcher(str).matches() ? "url" :
emailPattern.matcher(str).matches() ? "email" : null;
}
Map<String, Set<String>> map = Files.lines(filePath)
.map(str -> new AbstractMap.SimpleEntry<>(classify(str), str))
.filter(e -> e.getKey() != null)
.collect(Collectors.groupingBy(e -> e.getKey(),
Collectors.mapping(e -> e.getValue(), Collectors.toSet())));
Using my free StreamEx library the last step would be shorter:
Map<String, Set<String>> map = StreamEx.of(Files.lines(filePath))
.mapToEntry(str -> classify(str), Function.identity())
.nonNullKeys()
.grouping(Collectors.toSet());
Upvotes: 10
Reputation: 20658
The overall question should be: Why would you want to stream only once?
Extracting the URLs and extracting the emails are different operations and thus should be handled in their own streaming operations. Even if the underlying stream source contains hundreds of thousands of records, the time for iteration can be neglected when compared to the mapping and filtering operations.
The only thing you should consider as a possible performance issue is the IO operation. The cleanest solution therefore is to read the file only once and then stream on a resulting collection twice:
List<String> allLines = Files.readAllLines(filePath);
allLines.stream() ... // here do the URLs
allLines.stream() ... // here do the emails
Of course this requires some memory.
Upvotes: 0
Reputation: 3557
Since you can't reuse a Stream, the only option would be to "do it manually" I think.
File.lines(filePath).forEach(s -> /** match and sort into two lists */ );
If there's another solution for this though I'd be happy to learn about it!
Upvotes: 1