Reputation: 203
I am trying to extract data from PDF and splitting it into certain categories.I am able to extract data from PDF and Split it into categories on basis of their font size. For example:Lets say there are 3 category, Country category, capital category and city category. I am able to put all countries, capitals and cities into their respective categories. But I am not able to map which capital belong to which city and which Country or which country belong which city and capital. *It is reading data randomly, How I can Read data from bottom to Top without breaking the sequence, so I can Put first word in first category, 2nd into second and so on. *
Or anyone know some more efficient way? so I can put text into their respective categories and map it also.
I am using Java and Here is my code:
public class readPdfText {
public static void main(String[] args) {
try{
PdfReader reader = null;
String src = "pdffile.pdf";
try {
reader = new PdfReader("pdfile.pdf");
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
SemTextExtractionStrategy smt = new SemTextExtractionStrategy();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
PdfTextExtractor.getTextFromPage(reader, i, smt);
}
}catch(Exception e){
}
}
}
SemTextExtractionStrategy class:
public class SemTextExtractionStrategy implements TextExtractionStrategy {
private String text;
StringBuffer str = new StringBuffer();
StringBuffer item = new StringBuffer();
StringBuffer cat = new StringBuffer();
StringBuffer desc = new StringBuffer();
float temp = 0;
@Override
public void beginTextBlock() {
}
@Override
public void renderText(TextRenderInfo renderInfo) {
text = renderInfo.getText();
Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1),
topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();
compare(text, curFontSize);
}
private void add(String text2, float curFontSize) {
str.append(text2);
System.out.println("str: " + str);
}
public void compare(String text2, float curFontSize) {
// text2.getFont().getBaseFont().Contains("bold");
// temp = curFontSize;
boolean flag = check(text);
if (temp == curFontSize) {
str.append(text);
/*
* if (curFontSize == 11.222168){ item.append(str);
* System.out.println(item); }else if (curFontSize == 10.420532){
* desc.append(str); }
*/
// str.append(text);
} else {
if (temp>9.8 && temp<10){
String Contry= str.toString();
System.out.println("Contry: "+Contry);
}else if(temp>8 && temp <9){
String itemPrice= str.toString();
System.out.println("itemPrice: "+itemPrice);
}else if(temp >7 && temp< 7.2){
String captial= str.toString();
System.out.println("captial: "+captial);
}else if(temp >7.2 && temp <8){
String city= str.toString();
System.out.println("city: "+city);
}else{
System.out.println("size: "+temp+" "+"str: "+str);
}
temp = curFontSize;
// System.out.println(temp);
str.delete(0, str.length());
str.append(text);
}
}
private boolean check(String text2) {
return true;
}
@Override
public void endTextBlock() {
}
@Override
public void renderImage(ImageRenderInfo renderInfo) {
}
@Override
public String getResultantText() {
return text;
}
}
Upvotes: 1
Views: 1229
Reputation: 95928
It is reading data randomly, How I can Read data from bottom to Top without breaking the sequence, so I can Put first word in first category, 2nd into second and so on.
No, not randomly but instead in the order of the corresponding drawing operations in the content stream.
Your TextExtractionStrategy
implementation SemTextExtractionStrategy
simply uses the text in the order in which it is forwarded to it which is the order in which it is drawn. The order of the drawing operations does not need to be the reading order, though, as each drawing operation may start at a custom position on the page; if multiple fonts are used on one page, e.g., the text may be drawn grouped by font.
If you want to analyze the text from such a document, you first have to collect and sort the text fragments you get, and only when all text from the page is parsed, you can start analyzing it.
The LocationTextExtractionStrategy
(included in the iText distribution) can be taken as an example of a strategy doing just that. It uses its inner class TextChunk
for collecting the fragments, though, and this class does not carry the text ascent information you use in your code.
A SemLocationTextExtractionStrategy
, therefore, would have to use an extended TextChunk
class to also keep that information (or some information derived from it, e.g. a text category).
Furthermore the LocationTextExtractionStrategy
only sorts top to bottom, left to right. If your PDF has a different design, e.g. if it is multi-columnar, either your sorting has to be adapted or you have to use filters and analyze the page column by column.
BTW, your code to determine the font size
Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1),
topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();
does not return the actual font size but only the ascent above the base line. And even that only for unrotated text; as soon as rotation is part of the game, your code only returns the height of the rectangle enveloping the line from the start of the base line to the end of the ascent line. The length of the line from base line start to ascent line start would at least be independent from rotation.
Or anyone know some more efficient way?
Your task seems to depend very much on the PDF you are trying to extract information from. Without that PDF, therefore, tips for more efficient ways will remain vague.
Upvotes: 1