nxl10
nxl10

Reputation: 113

Inconsistent search results using Google's Custom Search JSON API

We are using Google's Custom Search JSON API for higher-education research, where we essentially are parsing through a large amount of URLs to find information on various organizations' responses to COVID-19. We are using Google's API to find top search results. However, we have found that there are inconsistent results when using different search parameters within the API query. The inconsistencies are an issue because we are trying to hone our query to a certain error rate (error rate being how many URLs provide effective research information). We are looking for someone to help explain how Google's API works, because the documentation is extremely minimal. An example of our base query: 'https://www.googleapis.com/customsearch/v1?key=KEY&cx=SEARCHENGINE&q="School Name" intext:(term1 | term2 | term3) -inurl:(unwanted1 | unwanted2 | unwanted3) inurl:(wanted1 | wanted2 | wanted3)&start=1'

Where "School Name" is the name of a higher-ed institution. Term1, term2, etc., are specific variables that we want to find in the body text of the search results. The intext parameter helps to avoid invisible text in certain documents. For example, insidehighered.com includes many higher-ed institutions in invisible text without the actual article being applicable. Unwanted 1, etc,. are words or phrases that we don't want included in the URL title. For example, we want to avoid PDF documents, so one could be ".pdf". Wanted1, etc,. are words that we do want in the URL, like "news". We use "|" to signify "or", which allows us to utilize one query for multiple types of searches, thus helping to minimize the cost of our API usage.

So far, we've found the following issues/inconsistencies:

  1. "-" and "NOT" to negate terms return different results.
  2. The order of parameters matters. For example, "inurl:(some wanted search terms) -inurl:(some unwanted search terms)" returns different results than "-inurl:(some unwanted search terms) inurl:(some wanted search terms)"
  3. Nesting of terms is also inconsistent. For example, "inurl:( (wanted terms | wanted terms) NOT(unwanted terms | unwanted terms))" returns different results than "inurl:(wanted term | wanted term| NOT unwanted term NOT unwanted term)"
  4. Furthermore, the API returns different results once in a while on certain queries using the same exact query two different times. It seems like the query will return 10 results, but spontaneously mix in the last 1 or 2 from either the next page, or somewhere else. For example, this query: "https://www.googleapis.com/customsearch/v1?key=KEY&cx=SEARCHENGINE&q="Miami University-Hamilton" intext:(reduce tuition | freeze tuition | decrease tuition | lower tuition) inurl:(news | announcement | article | story) -inurl:(registrar | admissions | tuition-and-fees | tuition-schedule | schedule | state | office | employment-opportunities | about-us | about | linkedin | events | .uk | irs | .gov | information-technology | wikipedia | wiki | employee-handbook | student-handbook | shop | annual | youtube | pinterest | store | openings | indeed | amazon | contact | job-board | jobboard | policies | frequently-asked-questions | faq | forms | hours | academic-calendar | calendar | directory | glassdoor | facebook | encyclopedia)&start=31" (and then start=41 for the next page) will return "http://www.harbison.one/archive/z_1985_national_cc_directory.pdf" as both the last item in the 4th page, and the 1st item on the 5th page. When we run our GET request, it will sometimes return a different result for the last item on the 4th page, but then will return that same duplicate URL for both pages.

Our code being used to pull the items off of each page is:

response = requests.get(query)
content = response.json()
hrefs = []

try:
    for i in content['items'][0:num]:
        hrefs.append(i['link'].lower())
  

except Exception as e:
    print(str(e))
    hrefs.append('a')


Thank you!

Upvotes: 7

Views: 2032

Answers (1)

katzenklavier
katzenklavier

Reputation: 193

You are very unlikely to find an answer to this question that satisfies the criteria you're looking for, I'm afraid.

Google, in order to protect its trade secrets (among other things), is extremely secretive about its internals with regards to its search engine algorithms. What we do know, from official sources, is the following:

  • Google makes extensive use of natural language processing (NLP) and will go to great lengths to try to tease out the intent of your query, even if that means ignoring what you actually searched for;
  • It likes pages that contain your keywords that you're searching for but it also has sophisticated protections in it to protect against "keyword stuffing," where someone shoves a ton of possible search criteria in their page to try to drive extra traffic to themselves;
  • It maintains an internal list of pages it trusts, and if those pages are linking to content (or if those pages are linking to pages which are linking to content, or so on), it ranks that content higher;
  • It scores pages based on a set of usability criteria and dislikes slow pages and pages that aren't optimized for different devices;
  • And, finally, it uses your location and past search history to determine what kind of results to serve you.

The first of these points is probably the most important here. Using the API to search is likely not granting your search any sort of special treatment, which is unusual for API behavior but sort of expected for Google. Google will happily bend its own rules for the sake of user experience, and I strongly suspect that your searches are falling victim to this sort of behavior on their end. Additionally, it's likely given the circumstances and all that they've hard-coded special things involving COVID-19 searches directly into the engine's behavior, which might be further complicating things.

I wish I had better news for you, but you're probably just going to have to make whatever weird and inconsistent things the search engine spits back out at you work. The results will almost certainly not be reproducible, and because of the fifth point listed above they may not even by reproducible by yourself later on. I'm sorry.

Upvotes: 4

Related Questions