Rilcon42
Rilcon42

Reputation: 9763

fread specifying separator within column

I am trying to parse a 2 column list that is separated using multiple spaces for columns and single spaces for words within a column. Nothing I have tried has successfully split the data into two columns. How do I do this?

library(data.table)
item.ids<-fread("http://eve-files.com/chribba/typeid.txt",sep2=" ")

Example of the dataset:

typeID      typeName
----------- ----------------------------------------
0           #System
2           Corporation
3           Region
4           Constellation
5           Solar System

Upvotes: 0

Views: 634

Answers (3)

Cybernetic
Cybernetic

Reputation: 13334

Read in your text file line-by-line:

l <- list()
fileName <- "http://eve-files.com/chribba/typeid.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
   l[i] <- list(linn[i])
}
close(conn)

Create a list of all entries:

l_new <- list()
for(p in 1:length(l)) {

    new_vec <- unlist(strsplit(gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", l[[p]], perl=TRUE), " "))

    if(!is.na(new_vec[4])) { 
        new_vec_t <- paste(new_vec[2], new_vec[3], new_vec[4])
    } 
    else if (!is.na(new_vec[3])) {
        new_vec_t <- paste(new_vec[2], new_vec[3])
    } 
    else {
        new_vec_t <- paste(new_vec[2])
    }

    l_new[p] <- list(c(new_vec[1], new_vec_t))

}

Convert your list to a dataframe:

l_new_frame <- data.frame(do.call('rbind', l_new))

l_new_frame <- l_new_frame[-c(1,2),]
names(l_new_frame) <- c('typeID', 'typeName')

Check results:

print(l_new_frame[1:100,], row.names = FALSE)

enter image description here

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Here's one approach that uses extract from "tidyr" that should be pretty easy to follow.

First, we read the data in, and inspect the first few lines and last few lines. After inspection, we find that the data values are from lines 3 to 22384.

x <- readLines("http://eve-files.com/chribba/typeid.txt")
# Check out the data
head(x) # Let's get rid of the first two lines...
tail(x) # ... and the last 3

In the extraction stage, we're basically looking for:

  • A set of numbers--can be of varying lengths (([0-9]+)). It's in (), so capture it and extract it to a new column.
  • The numbers should be followed by 2 or more spaces ([ ]{2,}). That's not in (), so we don't need to extract that into a new column.
  • The set of spaces can be followed by anything else ((.*)). That's in (), so capture that and extract it into a new column.

I've also used the first value of "x" to extract the original column names.

Here's what it looks like:

library(tidyverse)
data_frame(V1 = x[3:(length(x)-3)]) %>%
  extract(V1, into = scan(text = x[1], what = ""), regex = "([0-9]+)[ ]{2,}(.*)")
# # A tibble: 22,382 x 2
#    typeID           typeName
#  *  <chr>              <chr>
#  1      0            #System
#  2      2        Corporation
#  3      3             Region
#  4      4      Constellation
#  5      5       Solar System
#  6      6    Sun G5 (Yellow)
#  7      7    Sun K7 (Orange)
#  8      8 Sun K5 (Red Giant)
#  9      9      Sun B0 (Blue)
# 10     10     Sun F0 (White)
# # ... with 22,372 more rows

Or

data_frame(V1 = x[3:(length(x)-3)]) %>%
  separate(V1, into = scan(text = x[1], what = ""), sep = "[ ]{2,}", 
           extra = "merge", convert = TRUE)

Another approach might be to use strsplit with [ ]{2, } as the split value. do.call(rbind, ...) would be the idiom to follow after that, but you might want to filter only for cases where the split resulted in two values.

do.call(rbind, Filter(function(z) length(z) == 2, strsplit(x, "[ ]{2, }")))

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 145775

This seems to work:

library(readr)
url = "http://eve-files.com/chribba/typeid.txt"
df = read_fwf(url, fwf_empty(url), skip = 2)
colnames = read_table(url, n_max = 1)
names(df) = names(colnames)
df = na.omit(df)

dim(df)
# [1] 22382     2
summary(df)
 #    typeID         typeName        
 # Min.   :     0   Length:22382      
 # 1st Qu.: 13986   Class :character  
 # Median : 22938   Mode  :character  
 # Mean   : 53827                     
 # 3rd Qu.: 30209                     
 # Max.   :368620    

Upvotes: 1

Related Questions