smatthewenglish
smatthewenglish

Reputation: 2889

dynamically populate hashmap with human language dictionary for text analysis

I'm writing a software project to take as input a text in human language and determine what language it's written in.

My idea is that I'm going to store dictionaries in hashmaps, with the word as a key and a bool as a value.

If the document has that word I will flip the bool to ture.

Right now I'm trying to think of a good way to read in these dictionaries, put them into the hashmaps, the way I'm doing it now is very naieve and looks clunky, is there a better way to populate these hashmaps?

moreover, these dictionaries are huge. maybe this isn't the best way to do this, i.e. populate them all in succession like this.

I was thinking that it might be better to just consider one dictionary at a time, and then create a score, how many words of the input text registered with that document, save that, and then process the next dictionary. that would save on RAM, isn't it? Is that a good solution?

The code so far looks like this:

static HashMap<String, Boolean>  de_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean>  fr_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean>  ru_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean> eng_map = new HashMap<String, Boolean>();

public static void main(String[] args) throws IOException
{
    ArrayList<File> sub_dirs = new ArrayList<File>();

    final String filePath = "/home/matthias/Desktop/language_detective/word_lists_2";

    listf( filePath, sub_dirs );

    for(File dir : sub_dirs)
    {
        String word_holding_directory_path = dir.toString().toLowerCase();



        BufferedReader br = new BufferedReader(new FileReader( dir ));
        String line = null;
        while ((line = br.readLine()) != null)
        {
            //System.out.println(line);
            if(word_holding_directory_path.toLowerCase().contains("/de/") )
            {
                de_map.put(line, false);    
            }
            if(word_holding_directory_path.toLowerCase().contains("/ru/") )
            {
                ru_map.put(line, false);
            }
            if(word_holding_directory_path.toLowerCase().contains("/fr/") )
            {
                fr_map.put(line, false);
            }
            if(word_holding_directory_path.toLowerCase().contains("/eng/") )
            {
                eng_map.put(line, false);
            }
        }
    }

So I'm looking for advice on how I might populate them one at a time, and an opinion as to whether that's a good methodology, or suggestions about possibly better methodologies for acheiving this aim.

The full programme can be found here on my GitHub page.

27th

Upvotes: 0

Views: 306

Answers (1)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

The task of language identification is well researched and there are a lot of good libraries. For Java, try TIKA, or Language Detection Library for Java (they report "99% over precision for 53 languages"), or TextCat, or LingPipe - I'd suggest to start from the 1st, it seems to have the most detailed tutorial.

If you task is too specific for existed libraries (although I doubt this is the case), refer to this survey paper and adapt closest techniques.

If you do want to reinvent the wheel, e.g. for self-learning purposes, note that identification can be treated as a special case of text classification and read this basic tutorial for text classification.

Upvotes: 1

Related Questions