Reputation: 13
short backround: i try to enhance the spelling corrector by Peter Norvig in python. In this sense i need the occurrence of a sentence (up to 3-4 words)... The Ngram viewer from Google would help me a lot but i don't know how i get the value with an API or something else.
pseudocode:
# Sentence without meaning but word for word correct.
>> occurrence("were are you")
0.0000000978
# Sentence that makes sense
>> occurrence("where are you")
0.000148
# Then my method should return the sentence with the highest value. (But thats not the problem)
sorry for my english :-D Thank you!
Upvotes: 1
Views: 579
Reputation: 548
In 2023 there is NGRAMS which is a search engine and REST API for the Google Books Ngram Dataset v3. You can accomplish your task with a single request:
curl -G https://api.ngrams.dev/eng/search \
--data-urlencode query='"were are you" / "where are you"' \
-d flags=cs
The query "were are you" / "where are you"
asks to look up both were are you
and where are you
in the English corpus. flags=cs
triggers a case-sensitive search. By default NGRAMS is case-insensitive.
Output
{
"queryTokens": [
{ "text": "\"were are you\"", "type": "TERM_GROUP" },
{ "text": "/", "type": "SLASH" },
{ "text": "\"where are you\"", "type": "TERM_GROUP" }
],
"ngrams": [
{
"id": "b27a70f8e4b27baddd382ab96a986569",
"absTotalMatchCount": 1015443,
"relTotalMatchCount": 5.083529835293575e-7,
"tokens": [
{ "text": "where", "type": "TERM" },
{ "text": "are", "type": "TERM" },
{ "text": "you", "type": "TERM" }
]
},
{
"id": "36a5fb73ab627ac0a0c1b5559627cb39",
"absTotalMatchCount": 655,
"relTotalMatchCount": 3.2790733129454743e-10,
"tokens": [
{ "text": "were", "type": "TERM" },
{ "text": "are", "type": "TERM" },
{ "text": "you", "type": "TERM" }
]
}
]
}
The relTotalMatchCount
property is the frequency you are looking for.
Upvotes: 1
Reputation: 7631
They actually have an undocumented api.
import requests
import json
term = "where are you"
url =f"https://books.google.com/ngrams/json?content={term}&year_start=1800&year_end=2000&corpus=26&smoothing=3"
resp = requests.get(url)
if resp.ok:
results = json.loads(resp.content)
results[0]['timeseries']
has the frequencies you need:
[2.854326695000964e-07,
3.4926038665616944e-07,
3.3916604043800663e-07,
...]
Source: https://jameshfisher.com/2018/11/25/google-ngram-api/
Upvotes: 3