Reputation: 133
I'm trying to write code that parses a large text file. However, in order to get said text file, I run the original PDF file through pdfminer. While this works, it also returns my text file with many random spaces (see below)
SM ITH , JO HN , PHD
1234 S N O RT H AV E
Is there any easy way in Python to remove only certain spaces so words aren't separated? For the sample above, I want it to look like
SMITH, JOHN, PHD
1234 S NORTH AVE
Thanks.
Upvotes: 1
Views: 418
Reputation: 13632
What you are trying to do is impossible, e.g., should "DESK TOP" be "DESK TOP" or "DESKTOP"?
Upvotes: 0
Reputation: 365657
Most likely what you're trying to do is impossible to do perfectly, and very hard to do well enough to satisfy you. I'll explain below.
But there's a good chance you shouldn't be doing it in the first place. pdfminer
is highly configurable, and something like just specifying a smaller -M
value will give you the text you wanted in the first place. You'll need to do a bit of trial and error, but if this works, it'll be far easier than trying to post-process things after the fact.
If you want to do this, you need to come up with a rule that determines which spaces are "random extra spaces" and which are real spaces before you can code that in Python. And I don't know that there is any such rule.
In your example, you can handle most of them by just turning multiple spaces into single spaces, and single spaces into nothing. It should be obvious how to do that. Even if you can't think of a clever solution, a triple replace works fine:
s = re.sub(r'\s\s+', r'<space>', s)
s = re.sub(r'\s', r'', s)
s = re.sub(r'<space>', r' ', s)
However, this rule isn't quite right, because in JO HN , PHD
, the space after the comma isn't a random extra space, but it's not showing up as two or more spaces. And the same for the space in "1234 S". And, most likely, the same thing is true in lots of other cases for your real data.
A different somewhat close rule is that you only remove single spaces between letters. Again, if that works, it's easy to code. For example:
s = re.sub(r'(\w)\s(\w)', r'\1\2', s)
s = re.sub(r'\s+', r' ', s)
But now that leaves a space before the comma after SMITH
and JOHN
.
Maybe you need to put in a little information about English punctuation—strip the spaces around punctuation, then add back in the spaces after a comma or period, around quotes, etc.
Or… well, nobody but you can know what your data look like and figure it out.
If you can't come up with a good rule, the only option is to build some complicated heuristics around looking up possible words in a dictionary and guessing which one is more likely—which still won't get everything right (e.g., how do you know whether "B OO K M AR K" is "BOOK MARK" or "BOOKMARK"?), but it's the best you could possibly do.
Upvotes: 3