Sofie
Sofie

Reputation: 5

Reading raw data file dlm

I have problem reading in a raw data file. The problem is that some of inputs gets cut of because of the delimiter. Since one of the title has "\" in front of the real title, the Book_Title output is only "\". I was wondering if there is a way of ignoring those symbols.

Input:

0195153448;"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press"
085409878X;"\"Pie-powder\"; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"

The code:

data rating.books;
infile "&path\BX-Books.csv" dlm=';' missover dsd firstobs=2;
input   ISBN: $12.
            Book_Title: $quote150.
            Book_Author: $quote60.
            Year_Of_Publication: $quote8.
            Publisher: $quote60.;
run;

Output:

ISBN | Book-Title | Book-Author | Publisher | Publication-Year 
0195153448 | Classical Mythology | Mark P. O. Morford | Oxford University Press | 2002 
085409878X | \ | being dust from the law courts,"|  1973 | Missing value 

Desired output:

     ISBN | Book-Title | Book-Author | Publisher | Publication-Year 
    0195153448 | Classical Mythology | Mark P. O. Morford | Oxford University Press | 2002 
    085409878X | Pie-powder being dust from the law courts |John Alderson Foote | EP Publishing | 1973 

Upvotes: 0

Views: 157

Answers (2)

Tom
Tom

Reputation: 51611

It does not look like your source data is following any known pattern.

If you read it without the DSD option then it will treat the second line as having 6 fields.

085409878X;"\"Pie-powder\"; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"

v1=085409878X
v2="\"Pie-powder\"
v3=being dust from the law courts
v4=John Alderson Foote"
v5="1973"
v6="EP Publishing"

If you try to "fix" the escaped quotes

_infile_=tranwrd(_infile_,'\"','""');

then you will end up with only 4 fields.

085409878X;"""Pie-powder""; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"

v1=085409878X
v2="Pie-powder"; being dust from the law courts;John Alderson Foote
v3=1973
v4=EP Publishing
v5=
v6=

To get your desired output you could try removing the \"; and the "\" strings.

_infile_=tranwrd(_infile_,'\";',' ');
_infile_=tranwrd(_infile_,'"\"','');

Which does make it read as you want.

085409878X; Pie-powder  being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"

v1=085409878X
v2=Pie-powder  being dust from the law courts
v3=John Alderson Foote"
v4=1973
v5=EP Publishing
v6=

Not sure if that will generalize to other lines with extra quotes or extra semi-colons.

Upvotes: 1

Thogerar
Thogerar

Reputation: 339

You have to change a bit your code to put the missing column into a string $150. like that :

data work.books;
infile "h:\desktop\test.csv" dlm=';' missover dsd firstobs=1;
input   ISBN: $12.
            Book_Title: $150.
            Book_Author: $quote60.
            Year_Of_Publication: $quote8.
            Publisher: $quote60.;
run;

Then, you have to clean the column from special characters " and \ with this macro function :

%macro cleaningColumn(col);
    compress(strip(&col),'\"',' ')
%mend cleaningColumn;

You can include the macro function into a proc sql statement like this :

proc sql;
create table want as
    select 
        ISBN,
        %cleaningColumn(Book_Title) as Book_Title,
        Book_Author,
        Year_Of_Publication,
        Publisher
    from books;
run;

The column Book_Title will be like this :

Classical Mythology
Pie-powder

Regards,

Upvotes: 0

Related Questions