Reputation: 175

how can we extract text from pdf using itextsharp with spaces?

I am using below method to extract pdf text line by line. But problem that, it is not reading spaces between words and figures. what could be the solution for this ??

I just want to create a list of string, each string in list object has a text line from pdf as it is in pdf including spaces.

public void readtextlinebyline(string filename)   {


        List<string> strlist = new List<string>();
        PdfReader reader = new PdfReader(filename);
        string text = string.Empty;
        for (int page = 1; page <= 1; page++)
        {

            text += PdfTextExtractor.GetTextFromPage(reader, page ,new LocationTextExtractionStrategy())+" ";

        }
        reader.Close();
        string[] words = text.Split('\n');
        foreach (string word in words)
        {
            strlist.Add(word);
        }

        foreach (string st in strlist)
        {
            Response.Write(st +"<br/>");
        }

   }

I have tried this method by changing strategy to SimpleTextExtractionStrategy as well but it is also not working for me.

Upvotes: 5

Answers (4)

Sergii Volchkov

Reputation: 1166

The solution proposal and the background in @mkl's excellent answer are still valid, but for itex7 (in its current version 8), the derived LocationTextExtractionStrategy will look different due to the word boundary checks moved to an internal implementation of ITextChunkLocation interface, exposed by the TextChunks.

The internal implementation is of cours enot accesible to be derived from, but the logic from the source can be adapted for direct use in a derived LocationTextExtractionStrategy:

protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
{
    var @this = chunk.GetLocation();
    var previous = previousChunk.GetLocation();
    
    // In case a text chunk is of zero length, this probably means this is a mark character,
    // and we do not actually want to insert a space in such case
    if (@this.GetStartLocation().Equals(@this.GetEndLocation()) || previous.GetEndLocation().Equals(previous.GetStartLocation()))
    {
        return false;
    }
    
    float dist = @this.DistanceFromEndOf(previous);
    if (dist < 0)
    {
        dist = previous.DistanceFromEndOf(@this);

        // The situation when the chunks intersect. We don't need to add space in this case
        if (dist < 0)
        {
            return false;
        }
    }

    return dist > @this.GetCharSpaceWidth() / 2f;
}

The default strategy's algorithm has become notably more advanced in itext7, but the heuristic described by @mkl is still present and can we tweaked as needed.

In my case specifically, by replacing 2f with 4f, I was able to achieve even better text extraction from one PDF document than copying the text from Adobe Acrobat Reader - which otherwise worked much better than itext7 with its default strategy.

Upvotes: 0

Swapnil Somkuwar

Reputation: 1

using (PdfReader reader = new PdfReader(path))
            {
                StringBuilder text = new StringBuilder();
                StringBuilder textfinal = new StringBuilder();
                String page = "";
                for (int i = 1; i <= reader.NumberOfPages; i++)
                {
                    text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                    page = PdfTextExtractor.GetTextFromPage(reader, i);
                    string[] lines = page.Split('\n');
                    foreach (string line in lines)
                    {
                        string[] words = line.Split('\n');
                        foreach (string wrd in words)
                        {

                        }
                        textfinal.Append(line);
                        textfinal.Append(Environment.NewLine); 
                    }
                    page = "";
                }
           }

Upvotes: -1

mkl

Reputation: 95888

The backgrounds on why space between words sometimes is not properly recognized by iText(Sharp) or other PDF text extractors, have been explained in this answer to "itext java pdf to text creation": These 'spaces' are not necessarily created using a space character but instead using an operation creating a small gap. These operations are also used for other purposes (which do not break words), though, and so a text extractor must use heuristics to decide whether such a gap is a word break or not...

This especially implies that you never get a 100% secure word break detection.

What you can do, though, is to improve the heuristics used.

iText and iTextSharp standard text extraction strategies, e.g. assume a word break in a line if

a) there is a space character or

b) there is a gap at least as wide as half a space character.

Item a is a sure hit but item b may often fail in case of densely set text. The OP of the question to the answer referenced above got quite good results using a fourth of the width of a space character instead.

You can tweak these criteria by copying and changing the text extraction strategy of your choice.

In the SimpleTextExtractionStrategy you find this criterion embedded in the renderTextmethod:

if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
    AppendTextChunk(' ');
}

In case of the LocationTextExtractionStrategy this criterion meanwhile has been put into a method of its own:

/**
 * Determines if a space character should be inserted between a previous chunk and the current chunk.
 * This method is exposed as a callback so subclasses can fine tune the algorithm for determining whether a space should be inserted or not.
 * By default, this method will insert a space if the there is a gap of more than half the font space character width between the end of the
 * previous chunk and the beginning of the current chunk.  It will also indicate that a space is needed if the starting point of the new chunk 
 * appears *before* the end of the previous chunk (i.e. overlapping text).
 * @param chunk the new chunk being evaluated
 * @param previousChunk the chunk that appeared immediately before the current chunk
 * @return true if the two chunks represent different words (i.e. should have a space between them).  False otherwise.
 */
protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
    float dist = chunk.DistanceFromEndOf(previousChunk);
    if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
        return true;
    return false;
}

The intention for putting this into a method of its own was to merely require simple subclassing of the strategy and overriding this method to adjust the heuristics criteria. This works fine in case of the equivalent iText Java class but during the port to iTextSharp unfortunately no virtual has been added to the declaration (as of version 5.4.4). Thus, currently copying the whole strategy is still necessary for iTextSharp.

@Bruno You might want to tell the iText -> iTextSharp porting team about this.

While you can fine tune text extraction at these code locations you should be aware that you will not find a 100% criterion here. Some reasons are:

Gaps between words in densely set text can be smaller than kerning or other gaps for some optical effect inside words. Thus, there is no one-size-fits-all factor here.
In PDFs not using the space character at all (as you can always use gaps, this is possible), the "width of a space character" might be some random value or not determinable at all!
There are funny PDFs abusing the space character width (which can individually be stretched at any time for the operations to follow) to do some tabular formatting while using gaps for word breaking. In such a PDF the value of the current width of a space character cannot seriously be used to determine word breaks.
Sometimes you find s i n g l e words in a line printed spaced out for emphasis. These will likely be parsed as a collection of one-letter words by most heuristics.

You can get better than the iText heuristics and those derived from it using other constants by taking into account the actual visual free space between all characters (using PDF rendering or font information analysis mechanisms), but for a perceivable improvement you have to invest much time.

Upvotes: 17

Jaderson Linhares

Reputation: 89

I have my own implementation, and it works very well.

    /// <summary>
    /// Read a PDF file and returns the string content.
    /// </summary>
    /// <param name="par">ByteArray, MemoryStream or URI</param>
    /// <returns>FileContent.</returns>
    public static string ReadPdfFile(object par)
    {
        if (par == null) throw new ArgumentNullException("par");

        PdfReader pdfReader = null;
        var text = new StringBuilder();

        if (par is MemoryStream)
            pdfReader = new PdfReader((MemoryStream)par);
        else if (par is byte[])
            pdfReader = new PdfReader((byte[])par);
        else if (par is Uri)
            pdfReader = new PdfReader((Uri)par);

        if (pdfReader == null)
            throw new InvalidOperationException("Unable to read the file.");

        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }

        pdfReader.Close();

        return text.ToString();
    }

Upvotes: -1

how can we extract text from pdf using itextsharp with spaces?

Answers (4)

Related Questions