sacchh
sacchh

Reputation: 37

Importing .txt file in R

I want to import a txt file that has following pattern:

"X1"\"ID_T35.x"\"Produktionsdatum.x"\"Herstellernummer.x"\"Werksnummer.x"\"Fehlerhaft.x"\"Fehlerhaft_Datum.x"\"Fehlerhaft_Fahrleistung.x"\"ID_T35.y"\"Produktionsdatum.y"\"Herstellernummer.y"\"Werksnummer.y"\"Fehlerhaft.y"\"Fehlerhaft_Datum.y"\"Fehlerhaft_Fahrleistung.y""1"\2\"35-217-2173-92"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"2"\9\"35-217-2173-23"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"3"\10\"35-217-2173-24"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"4"\11\"35-217-2173-52"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"5"\12\"35-217-2173-104"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"6"\19\"35-217-2173-50"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"7"\20\"35-217-2173-93"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"8"\21\"35-217-2173-111"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"9"\22\"35-217-2173-113"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"10"\23\"35-217-2173-138"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"11"\24\"35-217-2173-150"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"12"\25\"35-217-2173-158"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"13"\33\"35-217-2173-8"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"14"\34\"35-217-2173-43"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"15"\35\"35-217-2173-79"\2008-11-07\"217"\2173\1\2010-04-28\41562.3534246575\NA\NA\NA\NA\NA\NA\NA"16"\36\"35-217-2173-102"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"17"\37\"35-217-2173-117"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"18"\38\"35-217-2173-139"\2008-11-07\"217"\2173\1\2010-04-28\41562.3534246575\NA\NA\NA\NA\NA\NA\NA"19"\40\"35-217-2173-55"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"20"\41\"35-217-2173-182"\2008-11-08\"217"\2173\1\2010-04-29\41648.7616438356\NA\NA\NA\NA\NA\NA\NA"21"\42\"35-217-2173-212"\2008-11-08\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"22"\43\"35-217-2173-227"\2008-11-08\"217"\2173\1\2010-04-29\41648.7616438356\NA\NA\NA\NA\NA\NA\NA"23"\44\"35-217-2173-283"\2008-11-08\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"24"\49\"35-217-2173-41"\2008-11-07\"217"\2173\0\NA\0\NA\NA\NA\NA\NA\NA\NA"25"

The backslash indicates a new column up until the string "1". That should be a new row. The dataset should have 818844 rows. The dataset is called Einzelteil_T35. My approach has been this one:

library(readr)
library(data.table)
library(tidyverse)

Einzelteil_T35 <- read_file("Einzelteil_T35.txt") %>%
  str_replace_all(. , "\"\"", ";") %>%
  str_replace_all(. ,"____", "\n" ) %>%    #"_____" means i haven't find a pattern to identify the new row
  fread(text = .) %>%
  as_tibble()

I've been thinking about a way of counting backslashes and creating a new row everytime I find the 15th backslash because the 15th backslash is the begining of a new row but I don't find a way. Maybe you can help me with this approach or with something completely different.

Thank you very much in advance

Upvotes: 0

Views: 461

Answers (2)

Martin Wettstein
Martin Wettstein

Reputation: 2894

This code reads the file, converts it to an Nx15 matrix, removes the line number from the last column, uses the first line as header and then converts it to a data.frame.

The only remaining problem is that the conversion to a matrix converted all cell contents to characters. You'll have to manually convert the columns back to numeric.

data = paste(scan("data_line.txt",what="character"),collapse='') ## Read the file
dmat = matrix(strsplit(data,"\\\\")[[1]],ncol=15,byrow=T) ## Convert it to a matrix
dmat[,15] = gsub("\".*[0-9]\"","",dmat[,15]) ## Remove the next line number from the values of the last column
colnames(dmat)=dmat[1,] ## Take first line as names
dmat = dmat[-1,] ## Remove first line (as it contained the names)
df = as.data.frame(dmat)

EDIT: Fixed the regular expression.

Upvotes: 1

flafont11
flafont11

Reputation: 137

If you can be sure that the pattern "1" only ever appears to mark the new row, maybe you can try this?

Edit: I meant without the first pipe.

library(readr)
library(data.table)
library(tidyverse)

Einzelteil_T35 <- read_file("Einzelteil_T35.txt") %>%
  str_replace_all(. ,'\\"1\\"', "\n" ) %>%   
  fread(text = .) %>%
  as_tibble()

If not maybe there is a way to make the pattern a bit more specific, for instance if the "" before the new row is always preceded by ".y" or something.

Upvotes: 1

Related Questions