cookies
cookies

Reputation: 347

Remove Specific Data from PDF

I can currently extract all the data from the pdf in question and have all the relative data and coordinates of the character data (e.g. I know character 'A' has the coordinates (x,y) relative to the pdf).

Each character is stored as an object in a list. However, when removing unnecessary data I am stuck with a portion that I still need to remove but don't quite know how to.

For example, the pdf I am currently extracting from is an exam question paper (before you ask it is for college so I have been given permission to use the data...). However, certain questions contain images. The images themselves aren't an issue, however, the text on top of them (for instance the labels on the axis of a graph) are extracted as text but I do not want them...

Example data input:

enter image description here

Once my initial cleanups are run, the outputted list of data will be:

1 (a) Blah Blah Blah. [1] (b) Blah Blah Blah.answer 1 answer 2 answer 3 answer 4 answer 5 [1] (c) Blah Blah Blah.282420161284002468 y x Fig. 1.1 Useful Information... (i) Blah Blah Blah. [1]

(Which typed out to be easier to read would be):

1
(a) Blah Blah Blah. [1]
(b) Blah Blah Blah.
    answer 1 answer 2 answer 3 answer 4 answer 5 [1]
(c) Blah Blah Blah.
    282420161284002468 y x Fig. 1.1
    Useful Information...
(i) Blah Blah Blah. [1]

Any advice on how to remove the data '282420161284002468 y x Fig. 1.1' from the list would be greatly appreciated.

Upvotes: 0

Views: 595

Answers (1)

mkl
mkl

Reputation: 95898

This is a partial solution, it removes everything with the exception of the figure title.

In the sample document the figures (excluding the figure titles) and only they are content marked with the tag EmbeddedDocument. To remove them from the extracted text, therefore, it suffices to ignore all text marked like that.

One can implement that either as a RenderFilter or by customizing the text extraction strategy. The OP's question seems to indicate that he uses a custom text extraction strategy anyways, so here an example of the latter option:

class TagFilteringExtractionStrategy : LocationTextExtractionStrategy
{
    FieldInfo MarkedContentInfosField = typeof(TextRenderInfo).GetField("markedContentInfos", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    FieldInfo MarkedContentInfoTagField = typeof(MarkedContentInfo).GetField("tag", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);

    PdfName EMBEDDED_DOCUMENT = new PdfName("EmbeddedDocument");

    public override void RenderText(TextRenderInfo renderInfo)
    {
        IList<MarkedContentInfo> markedContentInfos = (IList<MarkedContentInfo>)MarkedContentInfosField.GetValue(renderInfo);

        if (markedContentInfos != null && markedContentInfos.Count > 0)
        {
            foreach (MarkedContentInfo info in markedContentInfos)
            {
                if (EMBEDDED_DOCUMENT.Equals(MarkedContentInfoTagField.GetValue(info)))
                    return;
            }
        }

        base.RenderText(renderInfo);
    }
}

Applying the TagFilteringExtractionStrategy to the sample document like this

using (PdfReader reader = new PdfReader(filename))
{
    for (int page = 1; page <= reader.NumberOfPages; page ++)
    {
        String text = PdfTextExtractor.GetTextFromPage(reader, page, new TagFilteringExtractionStrategy());
        Console.Write("\n=======\nPage {0}\n=======\n{1}\n", page, text);
    }
}

one gets for the example section the output

1 (a) Define a vector quantity.
 ...................................................................................................................................................
 ............................................................................................................................................. [1]
 (b) Circle all the vector quantities in the list below.
    acceleration   speed   time   displacement   weight [1]
 (c) Fig. 1.1 shows graphs of velocity v against time t for two cars A and B travelling along a 
straight level road in the same direction.
Fig. 1.1
  At time t = 0, both cars are side-by-side.
  (i) Describe the motion of car A from t = 0 to t = 10 s.
 ...........................................................................................................................................
 ...........................................................................................................................................
 ..................................................................................................................................... [2]

Thus, the only remaining part of the figure is the title "Fig. 1.1".

Upvotes: 1

Related Questions