Thomas
Thomas

Reputation: 320

Make column of all matching substrings from a list that are found within a Polars string column

How do I return a column of all matching terms or substrings found within a string? I suspect there's a way to do it with pl.any_horizontal() as suggested in these comments but I can't quite piece it together.

import re

terms = ['a', 'This', 'e']

(pl.DataFrame({'col': 'This is a sentence'})
   .with_columns(matched_terms = pl.col('col').map_elements(lambda x: list(set(re.findall('|'.join(terms), x)))))
)

The column should return: ['a', 'This', 'e']

EDIT: The winning solution here: .str.extract_all('|'.join(terms)).list.unique() is different from this closely related question's winning solution: pl.col('col').str.split(' ').list.set_intersection(terms) because .set_intersection() doesn't get sub-strings of list elements (such as partial, not full, words).

Upvotes: 0

Views: 709

Answers (1)

Thomas
Thomas

Reputation: 320

I've included the accompanying term-matching columns, but the each_term column with pl.col('a').str.extract_all('|'.join(terms)) was the best solution for me.

pl.Config.set_fmt_table_cell_list_len(4)

terms = ['A', 'u', 'bug', 'g']

(pl.DataFrame({'a': 'A bug in a rug.'})
 .select(has_term = pl.col('a').str.contains_any(terms),
         has_term2 = pl.col('a').str.contains('|'.join(terms)),
         has_term3 = pl.any_horizontal(pl.col("a").str.contains(t) for t in terms),
         
         each_term = pl.col('a').str.extract_all('|'.join(terms)),
         
         whole_terms = pl.col('a').str.split(' ').list.set_intersection(terms),
         n_matched_terms = pl.col('a').str.count_matches('|'.join(terms)),
        )
)

shape: (1, 6)
┌──────────┬───────────┬───────────┬────────────────────────┬──────────────┬─────────────────┐
│ has_term ┆ has_term2 ┆ has_term3 ┆ each_term              ┆ whole_terms  ┆ n_matched_terms │
│ ---      ┆ ---       ┆ ---       ┆ ---                    ┆ ---          ┆ ---             │
│ bool     ┆ bool      ┆ bool      ┆ list[str]              ┆ list[str]    ┆ u32             │
╞══════════╪═══════════╪═══════════╪════════════════════════╪══════════════╪═════════════════╡
│ true     ┆ true      ┆ true      ┆ ["A", "bug", "u", "g"] ┆ ["A", "bug"] ┆ 4               │
└──────────┴───────────┴───────────┴────────────────────────┴──────────────┴─────────────────┘

Upvotes: 0

Related Questions