Reputation: 1954
I am trying to load some data into R that is in the following format (as a text file)
Name Country Age
John,Smith United Kingdom 20
Washington,George USA 50
Martin,Joseph Argentina 43
The problem I have is that the "columns" are separated by spaces such that they all line up nicely, but one row may have 5 spaces between values and the next 10 spaces. So when I load it in using read.delim
I get a one column data.frame with
"John,Smith United Kingdom 20"
as the first observation and so on.
Is there any way I can either:
My thought was to split the character strings by spaces, except it would need to be between 2 and x spaces (so, for example, "United Kingdom"
stays together and doesn't become "United" "" "Kingdom"
). But I don't know if that is possible.
I tried strsplit(data.frame[,1], sep="\\s")
but it returns a list of character strings like:
"John,Smith" "" "" "" "" "" "" "" "United" "" "Kingdom" "" ""...
which I don't know what to do with.
Upvotes: 5
Views: 3071
Reputation: 31181
You can do base R
, supposing your columns do not contain words with more than 1 space:
txt = "Name Country Age
John,Smith United Kingdom 20
Washington,George USA 50
Martin,Joseph Argentina 43"
conn = textConnection(txt)
do.call(rbind, lapply(readLines(conn), function(u) strsplit(u,'\\s{2,}')[[1]]))
# [,1] [,2] [,3]
#[1,] "Name" "Country" "Age"
#[2,] "John,Smith" "United Kingdom" "20"
#[3,] "Washington,George" "USA" "50"
#[4,] "Martin,Joseph" "Argentina" "43"
Upvotes: 2
Reputation: 193687
Having columns that all "line up nicely" is a typical characteristic of fixed-width data.
For the sake of this answer, I've written your three lines of data and one line of header information to a temporary file called "x". For your actual use, replace "x" with the file name/path, as you would normally use with read.delim
.
Here's the sample data:
x <- tempfile()
cat("Name Country Age\nJohn,Smith United Kingdom 20\nWashington,George USA 50\nMartin,Joseph Argentina 43\n", file = x)
R has it's own function for reading fixed width data (read.fwf
) but it is notoriously slow and you need to know the widths before you can get started. We can count those if the file is small, and then use something like:
read.fwf(x, c(22, 18, 4), strip.white = TRUE, skip = 1,
col.names = c("Name", "Country", "Age"))
# Name Country Age
# 1 John,Smith United Kingdom 20
# 2 Washington,George USA 50
# 3 Martin,Joseph Argentina 43
Alternatively, you can let fwf_widths
from the "readr" package do the guessing of widths for you, and then use read_fwf
:
library(readr)
read_fwf(x, fwf_empty(x, col_names = c("Name", "Country", "Age")), skip = 1)
# Name Country Age
# 1 John,Smith United Kingdom 20
# 2 Washington,George USA 50
# 3 Martin,Joseph Argentina 43
Upvotes: 4