Reputation: 113
We are using Google's Custom Search JSON API for higher-education research, where we essentially are parsing through a large amount of URLs to find information on various organizations' responses to COVID-19. We are using Google's API to find top search results. However, we have found that there are inconsistent results when using different search parameters within the API query. The inconsistencies are an issue because we are trying to hone our query to a certain error rate (error rate being how many URLs provide effective research information). We are looking for someone to help explain how Google's API works, because the documentation is extremely minimal. An example of our base query: 'https://www.googleapis.com/customsearch/v1?key=KEY&cx=SEARCHENGINE&q="School Name" intext:(term1 | term2 | term3) -inurl:(unwanted1 | unwanted2 | unwanted3) inurl:(wanted1 | wanted2 | wanted3)&start=1'
Where "School Name" is the name of a higher-ed institution. Term1, term2, etc., are specific variables that we want to find in the body text of the search results. The intext parameter helps to avoid invisible text in certain documents. For example, insidehighered.com includes many higher-ed institutions in invisible text without the actual article being applicable. Unwanted 1, etc,. are words or phrases that we don't want included in the URL title. For example, we want to avoid PDF documents, so one could be ".pdf". Wanted1, etc,. are words that we do want in the URL, like "news". We use "|" to signify "or", which allows us to utilize one query for multiple types of searches, thus helping to minimize the cost of our API usage.
So far, we've found the following issues/inconsistencies:
Our code being used to pull the items off of each page is:
response = requests.get(query)
content = response.json()
hrefs = []
try:
for i in content['items'][0:num]:
hrefs.append(i['link'].lower())
except Exception as e:
print(str(e))
hrefs.append('a')
Thank you!
Upvotes: 7
Views: 2032
Reputation: 193
You are very unlikely to find an answer to this question that satisfies the criteria you're looking for, I'm afraid.
Google, in order to protect its trade secrets (among other things), is extremely secretive about its internals with regards to its search engine algorithms. What we do know, from official sources, is the following:
The first of these points is probably the most important here. Using the API to search is likely not granting your search any sort of special treatment, which is unusual for API behavior but sort of expected for Google. Google will happily bend its own rules for the sake of user experience, and I strongly suspect that your searches are falling victim to this sort of behavior on their end. Additionally, it's likely given the circumstances and all that they've hard-coded special things involving COVID-19 searches directly into the engine's behavior, which might be further complicating things.
I wish I had better news for you, but you're probably just going to have to make whatever weird and inconsistent things the search engine spits back out at you work. The results will almost certainly not be reproducible, and because of the fifth point listed above they may not even by reproducible by yourself later on. I'm sorry.
Upvotes: 4