Reputation: 347
I can currently extract all the data from the pdf in question and have all the relative data and coordinates of the character data (e.g. I know character 'A' has the coordinates (x,y) relative to the pdf).
Each character is stored as an object in a list. However, when removing unnecessary data I am stuck with a portion that I still need to remove but don't quite know how to.
For example, the pdf I am currently extracting from is an exam question paper (before you ask it is for college so I have been given permission to use the data...). However, certain questions contain images. The images themselves aren't an issue, however, the text on top of them (for instance the labels on the axis of a graph) are extracted as text but I do not want them...
1 (a) Blah Blah Blah. [1] (b) Blah Blah Blah.answer 1 answer 2 answer 3 answer 4 answer 5 [1] (c) Blah Blah Blah.282420161284002468 y x Fig. 1.1 Useful Information... (i) Blah Blah Blah. [1]
1
(a) Blah Blah Blah. [1]
(b) Blah Blah Blah.
answer 1 answer 2 answer 3 answer 4 answer 5 [1]
(c) Blah Blah Blah.
282420161284002468 y x Fig. 1.1
Useful Information...
(i) Blah Blah Blah. [1]
Any advice on how to remove the data '282420161284002468 y x Fig. 1.1' from the list would be greatly appreciated.
Upvotes: 0
Views: 595
Reputation: 95898
This is a partial solution, it removes everything with the exception of the figure title.
In the sample document the figures (excluding the figure titles) and only they are content marked with the tag EmbeddedDocument. To remove them from the extracted text, therefore, it suffices to ignore all text marked like that.
One can implement that either as a RenderFilter
or by customizing the text extraction strategy. The OP's question seems to indicate that he uses a custom text extraction strategy anyways, so here an example of the latter option:
class TagFilteringExtractionStrategy : LocationTextExtractionStrategy
{
FieldInfo MarkedContentInfosField = typeof(TextRenderInfo).GetField("markedContentInfos", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
FieldInfo MarkedContentInfoTagField = typeof(MarkedContentInfo).GetField("tag", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
PdfName EMBEDDED_DOCUMENT = new PdfName("EmbeddedDocument");
public override void RenderText(TextRenderInfo renderInfo)
{
IList<MarkedContentInfo> markedContentInfos = (IList<MarkedContentInfo>)MarkedContentInfosField.GetValue(renderInfo);
if (markedContentInfos != null && markedContentInfos.Count > 0)
{
foreach (MarkedContentInfo info in markedContentInfos)
{
if (EMBEDDED_DOCUMENT.Equals(MarkedContentInfoTagField.GetValue(info)))
return;
}
}
base.RenderText(renderInfo);
}
}
Applying the TagFilteringExtractionStrategy
to the sample document like this
using (PdfReader reader = new PdfReader(filename))
{
for (int page = 1; page <= reader.NumberOfPages; page ++)
{
String text = PdfTextExtractor.GetTextFromPage(reader, page, new TagFilteringExtractionStrategy());
Console.Write("\n=======\nPage {0}\n=======\n{1}\n", page, text);
}
}
one gets for the example section the output
1 (a) Define a vector quantity.
...................................................................................................................................................
............................................................................................................................................. [1]
(b) Circle all the vector quantities in the list below.
acceleration speed time displacement weight [1]
(c) Fig. 1.1 shows graphs of velocity v against time t for two cars A and B travelling along a
straight level road in the same direction.
Fig. 1.1
At time t = 0, both cars are side-by-side.
(i) Describe the motion of car A from t = 0 to t = 10 s.
...........................................................................................................................................
...........................................................................................................................................
..................................................................................................................................... [2]
Thus, the only remaining part of the figure is the title "Fig. 1.1".
Upvotes: 1