Reputation: 5
I have problem reading in a raw data file. The problem is that some of inputs gets cut of because of the delimiter. Since one of the title has "\" in front of the real title, the Book_Title output is only "\". I was wondering if there is a way of ignoring those symbols.
Input:
0195153448;"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press"
085409878X;"\"Pie-powder\"; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"
The code:
data rating.books;
infile "&path\BX-Books.csv" dlm=';' missover dsd firstobs=2;
input ISBN: $12.
Book_Title: $quote150.
Book_Author: $quote60.
Year_Of_Publication: $quote8.
Publisher: $quote60.;
run;
Output:
ISBN | Book-Title | Book-Author | Publisher | Publication-Year
0195153448 | Classical Mythology | Mark P. O. Morford | Oxford University Press | 2002
085409878X | \ | being dust from the law courts,"| 1973 | Missing value
Desired output:
ISBN | Book-Title | Book-Author | Publisher | Publication-Year
0195153448 | Classical Mythology | Mark P. O. Morford | Oxford University Press | 2002
085409878X | Pie-powder being dust from the law courts |John Alderson Foote | EP Publishing | 1973
Upvotes: 0
Views: 157
Reputation: 51611
It does not look like your source data is following any known pattern.
If you read it without the DSD
option then it will treat the second line as having 6 fields.
085409878X;"\"Pie-powder\"; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"
v1=085409878X
v2="\"Pie-powder\"
v3=being dust from the law courts
v4=John Alderson Foote"
v5="1973"
v6="EP Publishing"
If you try to "fix" the escaped quotes
_infile_=tranwrd(_infile_,'\"','""');
then you will end up with only 4 fields.
085409878X;"""Pie-powder""; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"
v1=085409878X
v2="Pie-powder"; being dust from the law courts;John Alderson Foote
v3=1973
v4=EP Publishing
v5=
v6=
To get your desired output you could try removing the \";
and the "\"
strings.
_infile_=tranwrd(_infile_,'\";',' ');
_infile_=tranwrd(_infile_,'"\"','');
Which does make it read as you want.
085409878X; Pie-powder being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"
v1=085409878X
v2=Pie-powder being dust from the law courts
v3=John Alderson Foote"
v4=1973
v5=EP Publishing
v6=
Not sure if that will generalize to other lines with extra quotes or extra semi-colons.
Upvotes: 1
Reputation: 339
You have to change a bit your code to put the missing column into a string $150. like that :
data work.books;
infile "h:\desktop\test.csv" dlm=';' missover dsd firstobs=1;
input ISBN: $12.
Book_Title: $150.
Book_Author: $quote60.
Year_Of_Publication: $quote8.
Publisher: $quote60.;
run;
Then, you have to clean the column from special characters " and \ with this macro function :
%macro cleaningColumn(col);
compress(strip(&col),'\"',' ')
%mend cleaningColumn;
You can include the macro function into a proc sql statement like this :
proc sql;
create table want as
select
ISBN,
%cleaningColumn(Book_Title) as Book_Title,
Book_Author,
Year_Of_Publication,
Publisher
from books;
run;
The column Book_Title will be like this :
Classical Mythology
Pie-powder
Regards,
Upvotes: 0