Reputation: 11
I'm dealing with some census data that has been transcribed into a txt file. The fields are separated by spaces, however, rather than commas or tabs. Here are a few fields from a typical line, which will help illustrate my problem:
18A 1 239 18A Coffey Street 165 125 331 McLocklan Donald New York
Some of the fields are separated by multiple spaces, but some are separated by only one space. Some fields however, have more than one word in them (ex: New York), also separated by a single space.
I think I know how to do this by differentiating the single spaces between a lowercase letter and an uppercase letter versus the single spaces between two uppercase letters. I am still new to regex and am not sure how to do this however. Can anyone tell me how I can express the wish to replace a single space between a word/character group ending in a lowercase letter and a word/character group beginning with an uppercase letter with an underscore?
I think this would allow me to group things like Coffey_Street and New_York, without also connecting fields like 18A_Coffey. Any suggestions or advice would be most welcome. Thanks!
-Connor
Upvotes: 1
Views: 736
Reputation: 804
I would ask whoever sent you the file to send it again with a better delimiter. Adding an underscore between a lowercase and uppercase letter will not work in all cases.
That said, you can accomplish it with this command.
sed -r 's/([a-z]) ([A-Z])/\1_\2/g' file
Explanation
([a-z]) - match a lowercase character and group it
([A-Z]) - match an uppercase character and group it
the space in between - matches a space character
when sed
finds a match to that pattern it replaces it like this
\1 - puts back the lowercase character
_ - puts an _ where the space was
\2 - puts back the uppercase character
Upvotes: 1