Reputation: 2889
I'm writing a software project to take as input a text in human language and determine what language it's written in.
My idea is that I'm going to store dictionaries in hashmaps, with the word as a key and a bool as a value.
If the document has that word I will flip the bool to ture.
Right now I'm trying to think of a good way to read in these dictionaries, put them into the hashmaps, the way I'm doing it now is very naieve and looks clunky, is there a better way to populate these hashmaps?
moreover, these dictionaries are huge. maybe this isn't the best way to do this, i.e. populate them all in succession like this.
I was thinking that it might be better to just consider one dictionary at a time, and then create a score, how many words of the input text registered with that document, save that, and then process the next dictionary. that would save on RAM, isn't it? Is that a good solution?
The code so far looks like this:
static HashMap<String, Boolean> de_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean> fr_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean> ru_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean> eng_map = new HashMap<String, Boolean>();
public static void main(String[] args) throws IOException
{
ArrayList<File> sub_dirs = new ArrayList<File>();
final String filePath = "/home/matthias/Desktop/language_detective/word_lists_2";
listf( filePath, sub_dirs );
for(File dir : sub_dirs)
{
String word_holding_directory_path = dir.toString().toLowerCase();
BufferedReader br = new BufferedReader(new FileReader( dir ));
String line = null;
while ((line = br.readLine()) != null)
{
//System.out.println(line);
if(word_holding_directory_path.toLowerCase().contains("/de/") )
{
de_map.put(line, false);
}
if(word_holding_directory_path.toLowerCase().contains("/ru/") )
{
ru_map.put(line, false);
}
if(word_holding_directory_path.toLowerCase().contains("/fr/") )
{
fr_map.put(line, false);
}
if(word_holding_directory_path.toLowerCase().contains("/eng/") )
{
eng_map.put(line, false);
}
}
}
So I'm looking for advice on how I might populate them one at a time, and an opinion as to whether that's a good methodology, or suggestions about possibly better methodologies for acheiving this aim.
The full programme can be found here on my GitHub page.
27th
Upvotes: 0
Views: 306
Reputation: 4749
The task of language identification is well researched and there are a lot of good libraries. For Java, try TIKA, or Language Detection Library for Java (they report "99% over precision for 53 languages"), or TextCat, or LingPipe - I'd suggest to start from the 1st, it seems to have the most detailed tutorial.
If you task is too specific for existed libraries (although I doubt this is the case), refer to this survey paper and adapt closest techniques.
If you do want to reinvent the wheel, e.g. for self-learning purposes, note that identification can be treated as a special case of text classification and read this basic tutorial for text classification.
Upvotes: 1