Reputation: 10105
Any idea how to detect a source code (Java, C#, SQL and so on) in a text file with Java without looking at the file extension or using an extraordinary long, selfmade regular expression?
Maybe there are some tools doing this work already?
Upvotes: 6
Views: 1081
Reputation: 340883
Linguist
We use this library at GitHub to detect blob languages, highlight code, ignore binary files, suppress generated files in diffs and generate language breakdown graphs.
Unfortunately it is written in Ruby, maybe JRuby can handle it?
Upvotes: 3
Reputation: 1555
No, without using a syntax analyzer (which pretty much is the complex variant of a regexp), there is no way of seeing the difference between a source code file and a regular text file. The difference between source code and text is as simple as a one-letter-typo, if you think about it.
Upvotes: 1
Reputation: 28981
There is an old library, http://sourceforge.net/projects/jmimemagic/ try it, I hope it could give satisfactory results.
Upvotes: 1
Reputation: 76817
You should find a minimalistic amount of keywords and define some logical rules. If you define the right rules, the regular expression defined by them will be not extraordinary big. Note, that the fewer keywrods and rules you have, the bigger is the probability of a mistake (SourceCode = true for a file which is not a source code, SourceCode = false for a file which is a source code). Also, the more keywords and rules you have the more time is needed to check whether a file is a source code or not.
Upvotes: 1