Reputation: 938
Say you have the following paragraph in a Google Doc and you want to pull the element out of the url that relates to a car.
Some paragraph with some data in it has a url http://example.com/ford/some/other/data.html. There is also another link: http://example.com/ford/latest.html.
What I am looking for is pulling "ford" out of this paragraph so I can use it. And for the sake of simplicity I know the paragraph number, I will just call it "1" down below.
I have tried:
function getData() {
var paragraphs = DocumentApp.getActiveDocument().getBody().getParagraphs();
var element = paragraphs[1];
var re = element.findText('http://example.com/([a-z])+/');
var data = re.getElement().asText().getText();
Logger.log(data);
}
The problem is that data
contains the entire paragraph text.
Also is there a way to capture and use the data from a capturing group, aka the content in the ()?
Upvotes: 1
Views: 149
Reputation: 10345
As a supplementary to Tanaike's, this answer intends to show what could be done if you had to use the findText()
method (like simultaneously changing element attributes, highlighting matched ranges, etc).
The problem is "data" now is the entire paragraph
Well, this is exactly due to the instructions provided:
getElement()
is the Element
itself. asText()
on the Element
is Text
instance. getText()
on the Text
is, to quote docs:the contents of the element as text string
is there a way to capture and use the data
With findText()
it doesn't seem to be possible as per docs at the time of writing, to quote it for posterity:
A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.
What to do?
When a match is found, findText()
returns a RangeElement
instance which has two methods of interest: getStartOffset()
and getEndOffsetInclusive()
. The return values of these methods point to character indices of the element's text content. Thus, the matched substring can be extracted via a substring()
method (or via slice()
).
You can use from
parameter of the findText()
method recursively to iterate over the match results to obtain all matching ranges.
/**
* @summary pattern wrapper
* @param {string} linkPattern
* @param {RegExp} [infoPattern]
*/
const matchText = (linkPattern, infoPattern) =>
/**
* @summary finds links in text elements
* @param {GoogleAppsScript.Document.Paragraph} elem
* @param {string} [text]
* @param {GoogleAppsScript.Document.RangeElement} [from]
* @param {string[]} [matches]
* @returns {string[][]}
*/
(elem, text = elem.getText(), from, matches = []) => {
const match = from ?
elem.findText(linkPattern, from) :
elem.findText(linkPattern);
if(match) {
const rangeStart = match.getStartOffset();
const rangeEnd = match.getEndOffsetInclusive();
const link = text.substring( rangeStart, rangeEnd + 1 );
const [ full, ...groups ] = link.match( infoPattern );
matches.push(groups);
return matchText(linkPattern, infoPattern)(elem, text, match, matches);
}
return matches;
}
Driver script for testing:
function findText() {
const doc = getTestDoc(); //gets doc somehow, not provided here
const body = doc.getBody();
const par = body.appendParagraph("Some paragraph with some data in it https://example.com/ford/some/other/data.html.\nThere is another link also here https://example.com/ford/latest.html.");
const pattern = 'http(?:s)*:\/\/(?:www\.)*example\.com\/\\w+';
const targetPattern = /\/(\w+)$/;
const results = matchText(pattern,targetPattern)(par);
Logger.log(results); //[[ford], [ford]]
}
Notes
\w
,\s
, etc) to the expression string
, one has to escape the backslash (e.g. \\w
will be parsed correctly).string[][]
to extract all capturing groups ()References
getElement()
specasText()
specgetText()
specfindText()
specgetStartOffset()
specgetEndOffsetInclusive()
specsubstring()
docs on MDNUpvotes: 1
Reputation: 201388
I believe your goal like below.
ford
from the values like http://example.com/ford/latest.html
and http://example.com/ford/some/other/data.html
using Google Apps Script.For this, how about this modification?
In your script, when element.findText('http://example.com/([a-z])+/')
has a value, re.getElement().asText().getText()
is the text of the paragraph. In this case, it is found that the text with the pattern by element.findText()
is including in element
. Using this, how about retrieving the values like ford
from re.getElement().asText().getText()
?
var data = re.getElement().asText().getText();
Logger.log(data);
To:
if (re) {
var data = [...re.getElement().asText().getText().matchAll(/http:\/\/example\.com\/([\w\S]+?)\//g)];
console.log(data.map(([,e]) => e));
} else {
throw "Not match."
}
re
is null
. Please be careful.Upvotes: 3