Reputation: 324
I am going to a folder where there are PDF files. In the for loop I extract the text of each PDF file. The text (string) from my PDF file with the file names are stored in the JSON format named "e1". I then insert this e1 in the elastic search database. The index number is increased in the for loop every time.
I want to be able to get a list of Json objects based on a keyword search. So that I can see in which objects (the "e1" that I inserted in Elasticsearch) the keyword is present.
I now get the error DSL class science
does not exist in query. While the word science appears many many times in the PDF!
import PyPDF2
def read_pdf(pdf_file):
string_file=""
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content = page.extractText()
string_file+=page_content
return string_file
import glob
pdf_list=glob.glob('/home/Jen/Mongo/PDF/*.pdf')
from elasticsearch import Elasticsearch
es=Elasticsearch([{'host':'localhost','port':9200}])
count=0
for i in pdf_list:
count +=1
print(count)
stringi = i.replace('/home/Jen/Mongo/PDF/','')
text=(read_pdf(i))
lowercase_name=stringi.lower()
text=text.lower()
e1={
"filename":stringi,
"text":text}
res = es.index(index=count,doc_type='PDF',id=1,body=e1)
z=input("keyword")# I insert science here
z=z.lower()
from elasticsearch_dsl import Search
s = Search().using(es).query(z)
print(s)
Update This code does not print anything:
import PyPDF2
def read_pdf(pdf_file):
string_file=""
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content = page.extractText()
string_file+=page_content
return string_file
import glob
pdf_list=glob.glob('/home/Jen/Mongo/PDF/*.pdf')
from elasticsearch import Elasticsearch
es=Elasticsearch([{'host':'localhost','port':9200}])
count=0
for i in pdf_list:
count +=1
print(count)
stringi = i.replace('/home/Jen/Mongo/PDF/','')
text=(read_pdf(i))
lowercase_name=stringi.lower()
text=text.lower()
e1={
"filename":stringi,
"text":text}
res = es.index(index="my_name",doc_type='PDF',id=count, body=e1)
print("Test")
from elasticsearch_dsl import Search
s = Search(using=es, index="my_name").query("match", title="science")
response = s.execute()
for hit in response:
print(response.hits)
Upvotes: 0
Views: 1273
Reputation: 1285
with this line of code:
res = es.index(index=count,doc_type='PDF',id=1,body=e1)
you are creating indices 0,1,2..N
(because count is from 0
to N
), of type PDF
, and every document in each index has _id=1
Check the documentation
It should be something like:
res = es.index(index="my_name",doc_type='PDF',id=count, body=e1)
and if you did correctly first part of data processing, you should have all documents in my_name
index and each document would have his own _id
(from 1 to N).
Just run in Kibana GET _cat/indices?v
and check what you have with your slution and with these changes.
As a second part of question, you can search for "science" (for all documents) in my_index
with:
GET my_index/_search
{
"query": {
"match": {
"my_field": "science"
}
}
}
UPDATED or
GET my_index/_search
{
"query": {
"bool": {
"must": {
"match": {
"my_field": "science"
}
}
}
}
}
UPDATE 2 (Python)
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
client = Elasticsearch()
s = Search(using=client, index="my_index").query("match", title="science")
response = s.execute()
for hit in response:
print(response.hits)
# print(hit) / or print(hit.title, hit.id, ..)
Upvotes: 2