york.beta
york.beta

Reputation: 1827

Split java.util.stream.Stream

I have a text file that contains URLs and emails. I need to extract all of them from the file. Each URL and email can be found more then once, but result shouldn't contain duplicates. I can extract all URLs using the following code:

Files.lines(filePath).
    .map(urlPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

I can extract all emails using the following code:

Files.lines(filePath).
    .map(emailPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

Can I extract all URLs and emails reading the stream returned by Files.lines(filePath) only one time? Something like splitting stream of lines to stream of URLs and stream of emails.

Upvotes: 13

Views: 1365

Answers (4)

Holger
Holger

Reputation: 298539

You can perform the matching within a Collector:

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            if(m.matches())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(line);
            else if(m.usePattern(urlPattern).matches())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(line);
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );
Set<String> mail=map.get("mail"), url=map.get("url");

Note that this can easily be adapted to find multiple matches within a line:

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            while(m.find())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(m.group());
            m.usePattern(urlPattern).reset();
            while(m.find())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(m.group());
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );

Upvotes: 4

Tagir Valeev
Tagir Valeev

Reputation: 100309

You can use partitioningBy collector, though it's still not very elegant solution.

Map<Boolean, List<String>> map = Files.lines(filePath)
        .filter(str -> urlPattern.matcher(str).matches() ||
                       emailPattern.matcher(str).matches())
        .distinct()
        .collect(Collectors.partitioningBy(str -> urlPattern.matcher(str).matches()));
List<String> urls = map.get(true);
List<String> emails = map.get(false);

If you don't want to apply regexp twice, you can make it using the intermediate pair object (for example, SimpleEntry):

public static String classify(String str) {
    return urlPattern.matcher(str).matches() ? "url" : 
        emailPattern.matcher(str).matches() ? "email" : null;
}

Map<String, Set<String>> map = Files.lines(filePath)
        .map(str -> new AbstractMap.SimpleEntry<>(classify(str), str))
        .filter(e -> e.getKey() != null)
        .collect(Collectors.groupingBy(e -> e.getKey(),
            Collectors.mapping(e -> e.getValue(), Collectors.toSet())));

Using my free StreamEx library the last step would be shorter:

Map<String, Set<String>> map = StreamEx.of(Files.lines(filePath))
        .mapToEntry(str -> classify(str), Function.identity())
        .nonNullKeys()
        .grouping(Collectors.toSet());

Upvotes: 10

Seelenvirtuose
Seelenvirtuose

Reputation: 20658

The overall question should be: Why would you want to stream only once?

Extracting the URLs and extracting the emails are different operations and thus should be handled in their own streaming operations. Even if the underlying stream source contains hundreds of thousands of records, the time for iteration can be neglected when compared to the mapping and filtering operations.

The only thing you should consider as a possible performance issue is the IO operation. The cleanest solution therefore is to read the file only once and then stream on a resulting collection twice:

List<String> allLines = Files.readAllLines(filePath);
allLines.stream() ... // here do the URLs
allLines.stream() ... // here do the emails

Of course this requires some memory.

Upvotes: 0

mhlz
mhlz

Reputation: 3557

Since you can't reuse a Stream, the only option would be to "do it manually" I think.

File.lines(filePath).forEach(s -> /** match and sort into two lists */ );

If there's another solution for this though I'd be happy to learn about it!

Upvotes: 1

Related Questions