Reputation: 43
I have a large corpus of text data that I'm pre-processing for document classification with MALLET using openrefine.
Some of the cells are long (>150,000 characters) and I'm trying to split them into <1,000 word/token segments.
I'm able to split long cells into 6,000 character chunks using the "Split multi-valued cells" by field length, which roughly translates to 1,000 word/token chunks, but it splits words across rows, so I'm losing some of my data.
Is there a function I could use to split long cells by the first whitespace (" ") after every 6,000th character, or even better, split every 1,000 words?
Upvotes: 4
Views: 435
Reputation: 2830
The simplest way is probably to split your text by spaces, to insert a very rare character (or group of characters) after each group of 1000 elements, to reconcatenate, then to use "Split multivalued cells" with your weird character(s).
You can do that in GREL, but it will be much clearer by choosing "Python/Jython" as script language.
So: Edit cells -> Transform -> Python/Jython:
my_list = value.split(' ')
n = 1000
i = n
while i < len(my_list):
my_list.insert(i, '|||')
i+= (n+1)
return " ".join(my_list)
(For an explanation of this script, see here)
Here is a more compact version :
text = value.split(' ')
n = 1000
return "|||".join([' '.join(text[i:i+n]) for i in range(0,len(text),n)])
You can then split using ||| as separator.
If you prefer to split by characters instead of words, looks like you can do that in two lines with textwrap
:
import textwrap
return "|||".join(textwrap.wrap(value, 6000))
Upvotes: 1
Reputation: 173
Here is my simple solution:
Go to Edit cells -> Transform and enter
value.replace(/((\s+\S+?){999})\s+/,"$1@@@")
This will replace every 1000th whitespace (consecutive whitespaces are counted as one and replaced if they appear at the split border) with @@@ (you can choose any token you like, as long as it doesn't appear in the original text).
The go to Edit cells -> Split multi-valued cells and split using the token @@@ as separator.
Upvotes: 2