warriormole
warriormole

Reputation: 159

why letter 'f' oftentimes cannot be copied from text in pdf files?

I am not sure if this question qualifies here, but it seems odd to me letter 'f' often get messed up when copied from pdf text.

I do research as a student, and I read a lot of papers. This happens a lot when I want to copy the name of a paper to rename the pdf file.

For example, I opened the link a paper from built-in pdf display plug-in of Chrome on a Macbook Pro with OSX 10.9. Try copy the title of the paper and paste it. The 'f' in 'fluids' will be missing.

Upvotes: 11

Views: 9636

Answers (2)

user2846289
user2846289

Reputation: 2215

I think the reason why @warriormole can't copy fl is not the use of ligatures itself, but neglect or oversight on the side of PDF file creators. It was OK 10-15 or more years ago, everyone was happy just because there's some 'picture' in PDF and no one thought about content extraction and logical text rather that visual picture preservation in the long term, but now (file created in 2010) it's a shame.

PDF provides for methods to store Unicode representation of any glyph used, and file in question can be fixed relatively easy.

Upvotes: 6

Jan Schejbal
Jan Schejbal

Reputation: 4033

Not only the "f" will be missing, the "fl" will.

The reason for this are so-called "ligatures". In order to look nice, some combinations of letters, most notably fi, get combined into a single character. The special character is rarely treated correctly when copy-pasting. You can see this below. If you try to select the ligature, you will notice it is only one "letter". Note that your computer may render the two separate letters by using the ligature.

The following is a "fi" ligature: fi
The following is two letters: f‌i

Especially visible in a fixed-width font:

The following is a "fi" ligature: fi
The following is two letters:     f‌i

Upvotes: 13

Related Questions