AKIWEB
AKIWEB

Reputation: 19622

Extracting Metadata using Apache Tika and Storing into HashMap

I am trying to extract metadata using apache tika and then putting into HashMap.. But my code get's only the key not the value of that key.. For Example.. It stores only title(as a key) but not its value, in the same way it store keywords(as a key) but not its value..
And if i try to see what does md contains, its shows this:-

Description= title=Wireless Technology & Innovation | Mobile Technology Content-Encoding=UTF-8 Content-Type=text/html; charset=utf-8 Keywords= google-site-verification=AzhlXdqBSdUCRPJRY1evCtp2Ko5r9kxB_f81WffACUc 

    private Map<String, String> metaData;

        try {
                    Metadata md = new Metadata();
                    htmlStream = new ByteArrayInputStream(htmlContent.getBytes());
                    String parsedText = tika.parseToString(htmlStream, md);
                    //very unlikely to happen
                    if (text == null){
                        text = parsedText.trim();
                    }
                    processMetaData(md);
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    IOUtils.closeQuietly(htmlStream);
                }


        private void processMetaData(Metadata md){
                if ((getMetaData() == null) || (!getMetaData().isEmpty())) {
                    setMetaData(new HashMap<String, String>());
                }
                for (String name : md.names()){
//This below line is not working I guess, it stores only the key.. not the value of that particular key..      
    getMetaData().put(name.toLowerCase(), md.get(name));
                }
            }

        public Map<String, String> getMetaData() {
                return metaData;
            }

            public void setMetaData(Map<String, String> metaData) {
                this.metaData = metaData;
            }

Any help will be appreciated..

Upvotes: 1

Views: 2540

Answers (1)

Gagravarr
Gagravarr

Reputation: 48346

First up, Tika allows multiple values for a given key. You're better off thinking of it as a Map<String,List<String>> rather than a simple Map<String,String>

I'd suggest you look at the Tika Metadata JavaDocs. You'll either want to check the isMultiValued(String key) method for each one, or just call getValues(String key) every time

To get the first value for a given key, metadata.get(String key) is the right way to go. Not sure why it isn't working for you

You probably want to play with the Tika App jar, that's the best way to debug things, eg:

java -jar tika-app-1.0-SNAPSHOT.jar --metadata problem.file

That'll let you easily see the metadata your file really contains, then when you know that you can track down where in your code you're doing something wrong

Upvotes: 1

Related Questions