moman822
moman822

Reputation: 1954

Splitting a string by more than one space

I am trying to load some data into R that is in the following format (as a text file)

Name                  Country            Age
John,Smith            United Kingdom     20
Washington,George     USA                50
Martin,Joseph         Argentina          43

The problem I have is that the "columns" are separated by spaces such that they all line up nicely, but one row may have 5 spaces between values and the next 10 spaces. So when I load it in using read.delim I get a one column data.frame with

"John,Smith            United Kingdom     20"

as the first observation and so on.

Is there any way I can either:

  1. Load the data into R into a usable format? or
  2. Split the character strings up into separate columns once I load it in in the one column format?

My thought was to split the character strings by spaces, except it would need to be between 2 and x spaces (so, for example, "United Kingdom" stays together and doesn't become "United" "" "Kingdom"). But I don't know if that is possible.

I tried strsplit(data.frame[,1], sep="\\s") but it returns a list of character strings like:

"John,Smith" "" "" "" "" "" "" "" "United" "" "Kingdom" "" ""...

which I don't know what to do with.

Upvotes: 5

Views: 3071

Answers (2)

Colonel Beauvel
Colonel Beauvel

Reputation: 31181

You can do base R, supposing your columns do not contain words with more than 1 space:

txt = "Name                  Country            Age
John,Smith            United Kingdom     20
Washington,George     USA                50
Martin,Joseph         Argentina          43"

conn = textConnection(txt)
do.call(rbind, lapply(readLines(conn), function(u) strsplit(u,'\\s{2,}')[[1]]))
#     [,1]                [,2]             [,3] 
#[1,] "Name"              "Country"        "Age"
#[2,] "John,Smith"        "United Kingdom" "20" 
#[3,] "Washington,George" "USA"            "50" 
#[4,] "Martin,Joseph"     "Argentina"      "43" 

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

Having columns that all "line up nicely" is a typical characteristic of fixed-width data.

For the sake of this answer, I've written your three lines of data and one line of header information to a temporary file called "x". For your actual use, replace "x" with the file name/path, as you would normally use with read.delim.

Here's the sample data:

x <- tempfile()
cat("Name                  Country            Age\nJohn,Smith            United Kingdom     20\nWashington,George     USA                50\nMartin,Joseph         Argentina          43\n", file = x)

R has it's own function for reading fixed width data (read.fwf) but it is notoriously slow and you need to know the widths before you can get started. We can count those if the file is small, and then use something like:

read.fwf(x, c(22, 18, 4), strip.white = TRUE, skip = 1, 
         col.names = c("Name", "Country", "Age"))
#                Name        Country Age
# 1        John,Smith United Kingdom  20
# 2 Washington,George            USA  50
# 3     Martin,Joseph      Argentina  43

Alternatively, you can let fwf_widths from the "readr" package do the guessing of widths for you, and then use read_fwf:

library(readr)
read_fwf(x, fwf_empty(x, col_names = c("Name", "Country", "Age")), skip = 1)
#                Name        Country Age
# 1        John,Smith United Kingdom  20
# 2 Washington,George            USA  50
# 3     Martin,Joseph      Argentina  43

Upvotes: 4

Related Questions