Reputation: 9763
I am trying to parse a 2 column list that is separated using multiple spaces for columns and single spaces for words within a column. Nothing I have tried has successfully split the data into two columns. How do I do this?
library(data.table)
item.ids<-fread("http://eve-files.com/chribba/typeid.txt",sep2=" ")
Example of the dataset:
typeID typeName
----------- ----------------------------------------
0 #System
2 Corporation
3 Region
4 Constellation
5 Solar System
Upvotes: 0
Views: 634
Reputation: 13334
Read in your text file line-by-line:
l <- list()
fileName <- "http://eve-files.com/chribba/typeid.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
l[i] <- list(linn[i])
}
close(conn)
Create a list of all entries:
l_new <- list()
for(p in 1:length(l)) {
new_vec <- unlist(strsplit(gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", l[[p]], perl=TRUE), " "))
if(!is.na(new_vec[4])) {
new_vec_t <- paste(new_vec[2], new_vec[3], new_vec[4])
}
else if (!is.na(new_vec[3])) {
new_vec_t <- paste(new_vec[2], new_vec[3])
}
else {
new_vec_t <- paste(new_vec[2])
}
l_new[p] <- list(c(new_vec[1], new_vec_t))
}
Convert your list to a dataframe:
l_new_frame <- data.frame(do.call('rbind', l_new))
l_new_frame <- l_new_frame[-c(1,2),]
names(l_new_frame) <- c('typeID', 'typeName')
Check results:
print(l_new_frame[1:100,], row.names = FALSE)
Upvotes: 0
Reputation: 193517
Here's one approach that uses extract
from "tidyr" that should be pretty easy to follow.
First, we read the data in, and inspect the first few lines and last few lines. After inspection, we find that the data values are from lines 3 to 22384.
x <- readLines("http://eve-files.com/chribba/typeid.txt")
# Check out the data
head(x) # Let's get rid of the first two lines...
tail(x) # ... and the last 3
In the extraction stage, we're basically looking for:
([0-9]+)
). It's in ()
, so capture it and extract it to a new column.[ ]{2,}
). That's not in ()
, so we don't need to extract that into a new column.(.*)
). That's in ()
, so capture that and extract it into a new column.I've also used the first value of "x" to extract the original column names.
Here's what it looks like:
library(tidyverse)
data_frame(V1 = x[3:(length(x)-3)]) %>%
extract(V1, into = scan(text = x[1], what = ""), regex = "([0-9]+)[ ]{2,}(.*)")
# # A tibble: 22,382 x 2
# typeID typeName
# * <chr> <chr>
# 1 0 #System
# 2 2 Corporation
# 3 3 Region
# 4 4 Constellation
# 5 5 Solar System
# 6 6 Sun G5 (Yellow)
# 7 7 Sun K7 (Orange)
# 8 8 Sun K5 (Red Giant)
# 9 9 Sun B0 (Blue)
# 10 10 Sun F0 (White)
# # ... with 22,372 more rows
Or
data_frame(V1 = x[3:(length(x)-3)]) %>%
separate(V1, into = scan(text = x[1], what = ""), sep = "[ ]{2,}",
extra = "merge", convert = TRUE)
Another approach might be to use strsplit
with [ ]{2, }
as the split value. do.call(rbind, ...)
would be the idiom to follow after that, but you might want to filter only for cases where the split resulted in two values.
do.call(rbind, Filter(function(z) length(z) == 2, strsplit(x, "[ ]{2, }")))
Upvotes: 1
Reputation: 145775
This seems to work:
library(readr)
url = "http://eve-files.com/chribba/typeid.txt"
df = read_fwf(url, fwf_empty(url), skip = 2)
colnames = read_table(url, n_max = 1)
names(df) = names(colnames)
df = na.omit(df)
dim(df)
# [1] 22382 2
summary(df)
# typeID typeName
# Min. : 0 Length:22382
# 1st Qu.: 13986 Class :character
# Median : 22938 Mode :character
# Mean : 53827
# 3rd Qu.: 30209
# Max. :368620
Upvotes: 1