user3723721
user3723721

Reputation: 11

removing a whitespace between two specific words

I'm dealing with some census data that has been transcribed into a txt file. The fields are separated by spaces, however, rather than commas or tabs. Here are a few fields from a typical line, which will help illustrate my problem:

18A 1   239 18A Coffey Street     165    125 331 McLocklan      Donald     New York

Some of the fields are separated by multiple spaces, but some are separated by only one space. Some fields however, have more than one word in them (ex: New York), also separated by a single space.

I think I know how to do this by differentiating the single spaces between a lowercase letter and an uppercase letter versus the single spaces between two uppercase letters. I am still new to regex and am not sure how to do this however. Can anyone tell me how I can express the wish to replace a single space between a word/character group ending in a lowercase letter and a word/character group beginning with an uppercase letter with an underscore?

I think this would allow me to group things like Coffey_Street and New_York, without also connecting fields like 18A_Coffey. Any suggestions or advice would be most welcome. Thanks!

-Connor

Upvotes: 1

Views: 736

Answers (1)

yate
yate

Reputation: 804

I would ask whoever sent you the file to send it again with a better delimiter. Adding an underscore between a lowercase and uppercase letter will not work in all cases.

That said, you can accomplish it with this command.

sed -r 's/([a-z]) ([A-Z])/\1_\2/g' file

Explanation

([a-z]) - match a lowercase character and group it
([A-Z]) - match an uppercase character and group it
the space in between - matches a space character

when sed finds a match to that pattern it replaces it like this

\1 - puts back the lowercase character
_ - puts an _ where the space was
\2 - puts back the uppercase character

Upvotes: 1

Related Questions