MehmedB
MehmedB

Reputation: 1137

How to apply regex to a column and save regex match groups into multiple columns?

I am using openrefine to do some cleaning in my data-set. I am trying to apply a regex to a column in my dataset. That regex is returning multiple matching groups. I want to save those groups into different(respective) new columns. I can apply regex like this Edit column>Add column based on column. After selecting Python / Jython from the Language I am putting my Expression as shown below:

import re 
regex = r"custom_regex"
value = re.findall(regex, value)
# Check if anything matched with the regex and if so return the first match:
if len(value)>0:
    return value[0] 
# In order to get the groups: return value[0][0], or value[0][1], or value[0][2] etc.
# If there is no match, return value (empty list)
else:
    value = "No Match" #If you want it to return a message instead of empty list
    return value

But with this method, I can create only one column at a time. Is there a way to create columns as much as the regex matching groups?

Upvotes: 0

Views: 567

Answers (1)

Ettore Rizza
Ettore Rizza

Reputation: 2830

You cannot directly create more than one new column with OpenRefine. However, you can simplify your script by using Grel instead of Python:

if(value.find(/YOUR REGEX/) > 0, value.find(/YOUR REGEX/).join(|), "No match")

The .find()method in Grel (OpenRefine version >= 3) is pretty similar to re.findall()in Python.

Store the result in a new column, then use "Edit column/split into several columns" with a pipe (|) as separator to produce as many new columns as you have groups.


The Jython equivalent is probably something like this:

value = "1995 is a year"

Code

import re 
regex = r"(\d+).+?(year)"
match = re.findall(regex, value)
if match:
    return "|".join(value[0])
else:
    return "No Match" 

Result

1995|year

Upvotes: 1

Related Questions