inverted_index
inverted_index

Reputation: 2427

Query on xml file with special case

I have 2 large files which I gather from Stackoverflow named posts.xml and questions.txt with the following structure:

posts.xml:

<posts>
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="322" ViewCount="21888" Body="..."/>
  <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="140" ViewCount="10912" Body="..." />
  ...
</posts>

A post can be question or answer (both)

questions.txt:

Id,CreationDate,CreationDatesk,Score
123,2008-08-01 16:08:52,20080801,48
126,2008-08-01 16:10:30,20080801,33
...

I wanna query on posts just one time and index the selected rows (which their ID is in questions.txt file) with lucene. Since the xml file is very large (about 50GB), the time of querying and indexing is important for me.

Now the question is: How can I find all the selected rows in posts.xml that are repeated in questions.txt

This is my approach until now:

SAXParserDemo.java:

public class SAXParserDemo {
    public static void main(String[] args){

        try {
            File inputFile = new File("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Posts.xml");
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();
            UserHandler userhandler = new UserHandler();
            saxParser.parse(inputFile, userhandler);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Handler.java:

public class Handler extends DefaultHandler {

    public void getQuestiondId() {
        ArrayList<String> qIDs = new ArrayList<String>();
        BufferedReader br = null;
        try {
            String qId;
            br = new BufferedReader(new FileReader("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Q.txt"));
            while ((qId = br.readLine()) != null) {
                qId = qId.split(",")[0];  //this is question id
                findAndIndexOnPost(qId);    //find this id on posts.xml then index it!
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private void findAndIndexOnPost(String qID) {

    }

    @Override
    public void startElement(String uri,
                             String localName, String qName, Attributes attributes)
            throws SAXException {
        if (qName.equalsIgnoreCase("row")) {
            System.out.println(attributes.getValue("Id"));
            switch (attributes.getValue("PostTypeId")) {
                case "1":
                    String id = attributes.getValue("Id");
                    break;
                case "2":
                    break;
                default:
                    break;
            }

        }
    }
}

UPDATE:

I need to keep pointer on xml file in every iteration. But with SAX I don't know how to do this.

Upvotes: 0

Views: 49

Answers (1)

Boris Schegolev
Boris Schegolev

Reputation: 3701

What you have to do is:

  • read the TXT file (probably a simple stream will do).
  • add all Id values to a List<Integer> questionIds - one by one. You will have to parse them manually (with a regex or String.indexOf()).
  • in your Handler implementation simply compare if questionIds.contains(givenId).
  • send the received object (from XML) to Elastic Search with a simple REST request (POST/PUT).

Ta-da! Your data is now indexed with lucene.

Also, change the way you pass data to SAX Parser. Instead of giving it a File, create an implementation of InputStream for it which you can give to saxParser.parse(inputStream, userhandler);. Info on getting position in a stream here: Given a Java InputStream, how can I determine the current offset in the stream?.

Upvotes: 1

Related Questions