jennifer ruurs
jennifer ruurs

Reputation: 324

Elasticsearch could not find keyword search

I am going to a folder where there are PDF files. In the for loop I extract the text of each PDF file. The text (string) from my PDF file with the file names are stored in the JSON format named "e1". I then insert this e1 in the elastic search database. The index number is increased in the for loop every time.

I want to be able to get a list of Json objects based on a keyword search. So that I can see in which objects (the "e1" that I inserted in Elasticsearch) the keyword is present. I now get the error DSL class science does not exist in query. While the word science appears many many times in the PDF!

import PyPDF2

def read_pdf(pdf_file):
    string_file=""
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    for page_number in range(number_of_pages):
        page = read_pdf.getPage(page_number)
        page_content = page.extractText()
        string_file+=page_content
    return string_file

import glob
pdf_list=glob.glob('/home/Jen/Mongo/PDF/*.pdf')

from elasticsearch import Elasticsearch
es=Elasticsearch([{'host':'localhost','port':9200}])



count=0
for i in pdf_list:
    count +=1
    print(count)

    stringi = i.replace('/home/Jen/Mongo/PDF/','')
    text=(read_pdf(i))
    lowercase_name=stringi.lower()
    text=text.lower()
    e1={
    "filename":stringi,
    "text":text}
    res = es.index(index=count,doc_type='PDF',id=1,body=e1)

z=input("keyword")# I insert science here
z=z.lower()

from elasticsearch_dsl import Search

s = Search().using(es).query(z)
print(s)

Update This code does not print anything:

import PyPDF2

def read_pdf(pdf_file):
    string_file=""
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    for page_number in range(number_of_pages):
        page = read_pdf.getPage(page_number)
        page_content = page.extractText()
        string_file+=page_content
    return string_file

import glob
pdf_list=glob.glob('/home/Jen/Mongo/PDF/*.pdf')

from elasticsearch import Elasticsearch
es=Elasticsearch([{'host':'localhost','port':9200}])



count=0
for i in pdf_list:
    count +=1
    print(count)

    stringi = i.replace('/home/Jen/Mongo/PDF/','')
    text=(read_pdf(i))
    lowercase_name=stringi.lower()
    text=text.lower()
    e1={
    "filename":stringi,
    "text":text}
    res = es.index(index="my_name",doc_type='PDF',id=count, body=e1)

print("Test")
from elasticsearch_dsl import Search    

s = Search(using=es, index="my_name").query("match", title="science")

response = s.execute()

for hit in response:
    print(response.hits)

Upvotes: 0

Views: 1273

Answers (1)

dejanmarich
dejanmarich

Reputation: 1285

with this line of code:

res = es.index(index=count,doc_type='PDF',id=1,body=e1)

you are creating indices 0,1,2..N (because count is from 0 to N), of type PDF, and every document in each index has _id=1

Check the documentation

It should be something like:

res = es.index(index="my_name",doc_type='PDF',id=count, body=e1)

and if you did correctly first part of data processing, you should have all documents in my_name index and each document would have his own _id (from 1 to N).

Just run in Kibana GET _cat/indices?v and check what you have with your slution and with these changes.

As a second part of question, you can search for "science" (for all documents) in my_index with:

GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "science"
    }
  }
}

UPDATED or

GET my_index/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "my_field": "science"
        }
      }
    }
  }
}

UPDATE 2 (Python)

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

client = Elasticsearch()

s = Search(using=client, index="my_index").query("match", title="science")

response = s.execute()

for hit in response:
    print(response.hits)
    # print(hit) / or print(hit.title, hit.id, ..)

Upvotes: 2

Related Questions