jordan2175
jordan2175

Reputation: 938

Get path element from URL using findText()

Say you have the following paragraph in a Google Doc and you want to pull the element out of the url that relates to a car.

Some paragraph with some data in it has a url http://example.com/ford/some/other/data.html. There is also another link: http://example.com/ford/latest.html.

What I am looking for is pulling "ford" out of this paragraph so I can use it. And for the sake of simplicity I know the paragraph number, I will just call it "1" down below.

I have tried:

function getData() {
  var paragraphs = DocumentApp.getActiveDocument().getBody().getParagraphs();
  var element = paragraphs[1];
  var re = element.findText('http://example.com/([a-z])+/');
  var data = re.getElement().asText().getText();
  Logger.log(data);
}

The problem is that data contains the entire paragraph text.

Also is there a way to capture and use the data from a capturing group, aka the content in the ()?

Upvotes: 1

Views: 149

Answers (2)

0Valt
0Valt

Reputation: 10345

As a supplementary to Tanaike's, this answer intends to show what could be done if you had to use the findText() method (like simultaneously changing element attributes, highlighting matched ranges, etc).


The problem is "data" now is the entire paragraph

Well, this is exactly due to the instructions provided:

  1. Result of getElement() is the Element itself.
  2. Result of asText() on the Element is Text instance.
  3. Result of getText() on the Text is, to quote docs:

the contents of the element as text string


is there a way to capture and use the data

With findText() it doesn't seem to be possible as per docs at the time of writing, to quote it for posterity:

A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.


What to do?

When a match is found, findText() returns a RangeElement instance which has two methods of interest: getStartOffset() and getEndOffsetInclusive(). The return values of these methods point to character indices of the element's text content. Thus, the matched substring can be extracted via a substring() method (or via slice()).

You can use from parameter of the findText() method recursively to iterate over the match results to obtain all matching ranges.

/**
 * @summary pattern wrapper
 * @param {string} linkPattern
 * @param {RegExp} [infoPattern]
 */
const matchText = (linkPattern, infoPattern) => 

  /**
   * @summary finds links in text elements
   * @param {GoogleAppsScript.Document.Paragraph} elem
   * @param {string} [text]
   * @param {GoogleAppsScript.Document.RangeElement} [from]
   * @param {string[]} [matches]
   * @returns {string[][]}
   */ 
  (elem, text = elem.getText(), from, matches = []) => {

    const match = from ? 
      elem.findText(linkPattern, from) : 
      elem.findText(linkPattern);

    if(match) {
       const rangeStart = match.getStartOffset();
       const rangeEnd = match.getEndOffsetInclusive();

       const link = text.substring( rangeStart, rangeEnd + 1 );
       const [ full, ...groups ] = link.match( infoPattern );

       matches.push(groups);

       return matchText(linkPattern, infoPattern)(elem, text, match, matches);
    }

    return matches;
  }

Driver script for testing:

function findText() {
  const doc = getTestDoc(); //gets doc somehow, not provided here

  const body = doc.getBody();

  const par = body.appendParagraph("Some paragraph with some data in it https://example.com/ford/some/other/data.html.\nThere is another link also here https://example.com/ford/latest.html.");

  const pattern = 'http(?:s)*:\/\/(?:www\.)*example\.com\/\\w+';
  const targetPattern = /\/(\w+)$/;

  const results = matchText(pattern,targetPattern)(par);

  Logger.log(results); //[[ford], [ford]]
}

Notes

  1. Curious observation: apparently, to pass tokens (\w,\s, etc) to the expression string, one has to escape the backslash (e.g. \\w will be parsed correctly).
  2. Note that the solution above returns a string[][] to extract all capturing groups ()
  3. The example code above is designed for the V8 runtime.

References

  1. getElement() spec
  2. asText() spec
  3. getText() spec
  4. findText() spec
  5. getStartOffset() spec
  6. getEndOffsetInclusive() spec
  7. substring() docs on MDN

Upvotes: 1

Tanaike
Tanaike

Reputation: 201388

I believe your goal like below.

  • You want to retrieve the value of ford from the values like http://example.com/ford/latest.html and http://example.com/ford/some/other/data.html using Google Apps Script.
  • Those values are put in Google Document.

For this, how about this modification?

Modification points:

In your script, when element.findText('http://example.com/([a-z])+/') has a value, re.getElement().asText().getText() is the text of the paragraph. In this case, it is found that the text with the pattern by element.findText() is including in element. Using this, how about retrieving the values like ford from re.getElement().asText().getText()?

Modified script:

From:
var data = re.getElement().asText().getText();
Logger.log(data);
To:
if (re) {
  var data = [...re.getElement().asText().getText().matchAll(/http:\/\/example\.com\/([\w\S]+?)\//g)];
  console.log(data.map(([,e]) => e));
} else {
  throw "Not match."
}
  • When the paragraph has no values which maches to your regex, re is null. Please be careful.

Note:

  • Please use the script with enabling V8.

Reference:

Upvotes: 3

Related Questions