exrolex
exrolex

Reputation: 13

How i get the occurrence of a sentence with google ngram viewer and python?

short backround: i try to enhance the spelling corrector by Peter Norvig in python. In this sense i need the occurrence of a sentence (up to 3-4 words)... The Ngram viewer from Google would help me a lot but i don't know how i get the value with an API or something else.

pseudocode:

# Sentence without meaning but word for word correct.
>> occurrence("were are you")
0.0000000978

# Sentence that makes sense
>> occurrence("where are you")
0.000148

# Then my method should return the sentence with the highest value. (But thats not the problem)

sorry for my english :-D Thank you!

Upvotes: 1

Views: 579

Answers (2)

Martin Trenkmann
Martin Trenkmann

Reputation: 548

In 2023 there is NGRAMS which is a search engine and REST API for the Google Books Ngram Dataset v3. You can accomplish your task with a single request:

curl -G https://api.ngrams.dev/eng/search \
--data-urlencode query='"were are you" / "where are you"' \
-d flags=cs

The query "were are you" / "where are you" asks to look up both were are you and where are you in the English corpus. flags=cs triggers a case-sensitive search. By default NGRAMS is case-insensitive.

Output

{
  "queryTokens": [
    { "text": "\"were are you\"", "type": "TERM_GROUP" },
    { "text": "/", "type": "SLASH" },
    { "text": "\"where are you\"", "type": "TERM_GROUP" }
  ],
  "ngrams": [
    {
      "id": "b27a70f8e4b27baddd382ab96a986569",
      "absTotalMatchCount": 1015443,
      "relTotalMatchCount": 5.083529835293575e-7,
      "tokens": [
        { "text": "where", "type": "TERM" },
        { "text": "are", "type": "TERM" },
        { "text": "you", "type": "TERM" }
      ]
    },
    {
      "id": "36a5fb73ab627ac0a0c1b5559627cb39",
      "absTotalMatchCount": 655,
      "relTotalMatchCount": 3.2790733129454743e-10,
      "tokens": [
        { "text": "were", "type": "TERM" },
        { "text": "are", "type": "TERM" },
        { "text": "you", "type": "TERM" }
      ]
    }
  ]
}

The relTotalMatchCount property is the frequency you are looking for.

Upvotes: 1

dimid
dimid

Reputation: 7631

They actually have an undocumented api.

import requests
import json

term = "where are you"
url =f"https://books.google.com/ngrams/json?content={term}&year_start=1800&year_end=2000&corpus=26&smoothing=3"
resp = requests.get(url)
if resp.ok:
  results = json.loads(resp.content)

results[0]['timeseries'] has the frequencies you need:

[2.854326695000964e-07,
 3.4926038665616944e-07,
 3.3916604043800663e-07,
 ...]

Source: https://jameshfisher.com/2018/11/25/google-ngram-api/

Upvotes: 3

Related Questions