Reputation: 1271
I am running into some strange behavior when importing a .csv
into R. Certain entries don't seem to load correctly. Looking at the file in a text editor I cannot see the problem and the file loads correctly in excel.
Here is the file: search.worldbank.org/api/projects/all.csv (from the Data & Resources
tab here: https://datacatalog.worldbank.org/dataset/world-bank-projects-operations)
This file is a bit of a mess, but just focusing on the id
variable I am running into strange behavior.
Here is some code:
#download the file as 'api.csv' to the working directory
#load the file into R
wb <- read.csv('api.csv')
I expect wb$id
to be vector of numbers starting with P. However this is not the case. Calling wb$id
gives a messy output. To identify an offending entry:
#search for strings with a space in them
grep(' ', wb$id)
[1] 5212 5214 5216 5241 5248 5253 5254 5255 5263 5288 5291 5293 5295 5296 5298 5303
#look at the id variable for one of these entries
wb[5212 , 1]
[1] "000,EA PE,EA PE,http://projects.worldbank.org/P125388/gpoba-w3-kenya-electricity?lang=en,,,Other Energy and Extractives!$!100!$!LZ,,,,,Other Energy and Extractives;Other Energy and Extractives,,,,,,Energy and Extractives;Energy and Extractives,Other urban development!$!100!$!74,,,,,,Corporate Advocacy Priorities;Corporate Advocacy Priorities;CAP,,,,,,,0000178145!$!West Pokot District!$!1.75!$!35.25!$!;0000178440!$!Wajir District!$!1.75!$!40.01667!$!;0000178837!$!Uasin Gishu!$!0.5!$!35.316669!$!;0000178914!$!Turkana District!$!3!$!35.5!$!;0000179068!$!Trans Nzoia District!$!1.045!$!34.979!$!;0000179380!$!Tharaka District!$!-0.10094!$!38.028831!$!;0000179585!$!Tana River District!$!-1.53333!$!39.416672!$!;0000180320!$!Siaya District!$!0.105!$!34.301998!$!;0000180782!$!Samburu District!$!1.33333!$!37.116669!$!;0000184742!$!Nairobi Province!$!-1.28333!$!36.833328!$!;0000185578!$!Murang'a District!$!-0.68400002!$!36.993999!$!;0000186298!$!Mombasa District!$!-4.02!$!39.666672!$!;0000186824!$!Meru Central District!$!0!$!37.518002!$!;0000187583!$!Marsabit District!$!2.96667!$!37.599998!$!;0000187895!$!Mandera District!$!3.3666699!$!40.700001!$!;0000189794!$!Laikipia District!$!0.33333001!$!36.76667!$!;0000190106!$!Kwale District!$!-4.1333299!$!39.200001!$!;0000191037!$!Kitui District!$!-1.835!$!38.451!$!;0000191242!$!Kisumu!$!-0.069!$!34.64!$!;0000191298!$!Kisii District!$!-0.75!$!34.833328!$!;0000191420!$!Kirinyaga District!$!-0.5!$!37.333328!$!;0000192064!$!Kilifi District!$!-3.563!$!39.644001!$!;0000192709!$!Kiambu District!$!-1.09!$!36.699001!$!;0000192898!$!Kericho District!$!-0.273!$!35.382999!$!;0000195271!$!Kakamega District!$!0.33399999!$!34.797001!$!;0000196228!$!Isiolo District!$!0.98333001!$!38.533329!$!;0000197744!$!Garissa District!$!-0.17200001!$!40.041!$!;0000198474!$!Embu District!$!-0.42500001!$!37.530998!$!;0000199987!$!Busia District!$!0.34999999!$!34.169998!$!;0000200066!$!Bungoma District!$!0.66000003!$!34.639!$!;0000200573!$!Baringo District!$!0.66667002!$!36!$!;0007603036!$!Nyandarua District!$!-0.34400001!$!36.5!$!;0007667638!$!Vihiga District!$!0.072!$!34.712!$!;0007667643!$!Lamu!$!-2.28!$!40.900002!$!;0007667644!$!Machakos District!$!-1.2819999!$!37.408001!$!;0007667645!$!Makueni District!$!-2.2219999!$!37.872002!$!;0007667646!$!Marakwet District!$!0.99000001!$!35.549999!$!;0007667652!$!Taita Taveta District!$!-3.4000001!$!38.369999!$!;0007667657!$!Kajiado District!$!-2.131!$!36.877998!$!;0007667661!$!Nyeri District!$!-0.41999999!$!36.950001!$!;0007667665!$!Homa Bay District!$!-0.66600001!$!34.480999!$!;0007667666!$!Bomet District!$!-0.79000002!$!35.349998!$!;0007667678!$!Migori District!$!-0.98199999!$!34.409!$!;0007668899!$!Keiyo!$!0.47753999!$!35.558891!$!;0007668902!$!Nakuru District!$!-0.48401999!$!36.171261!$!;0007668904!$!Narok District!$!-1.24076!$!35.7356!$!;0007806857!$!Nyamira District!$!-0.75!$!35!$!;0008051212!$!Nandi South District!$!0.055!$!35.193001!$!,0000178145;0000178440;0000178837;0000178914;0000179068;0000179380;0000179585;0000180320;0000180782;0000184742;0000185578;0000186298;0000186824;0000187583;0000187895;0000189794;0000190106;0000191037;0000191242;0000191298;0000191420;0000192064;0000192709;0000192898;0000195271;0000196228;0000197744;0000198474;0000199987;0000200066;0000200573;0007603036;0007667638;0007667643;0007667644;0007667645;0007667646;0007667652;0007667657;0007667661;0007667665;0007667666;0007667678;0007668899;0007668902;0007668904;0007806857;0008051212,West Pokot District;Wajir District;Uasin Gishu;Turkana District;Trans Nzoia District;Tharaka District;Tana River District;Siaya District;Samburu District;Nairobi Province;Murang'a District;Mombasa District;Meru Central District;Marsabit District;Mandera District;Laikipia District;Kwale District;Kitui District;Kisumu;Kisii District;Kirinyaga District;Kilifi District;Kiambu District;Kericho District;Kakamega District;Isiolo District;Garissa District;Embu District;Busia District;Bungoma District;Baringo District;Nyandarua District;Vihiga District;Lamu;Machakos District;Makueni District;Marakwet District;Taita Taveta District;Kajiado District;Nyeri District;Homa Bay District;Bomet District;Migori District;Keiyo;Nakuru District;Narok District;Nyamira District;Nandi South District,1.75;1.75;0.5;3;1.045;-0.10094;-1.53333;0.105;1.33333;-1.28333;-0.68400002;-4.02;0;2.96667;3.3666699;0.33333001;-4.1333299;-1.835;-0.069;-0.75;-0.5;-3.563;-1.09;-0.273;0.33399999;0.98333001;-0.17200001;-0.42500001;0.34999999;0.66000003;0.66667002;-0.34400001;0.072;-2.28;-1.2819999;-2.2219999;0.99000001;-3.4000001;-2.131;-0.41999999;-0.66600001;-0.79000002;-0.98199999;0.47753999;-0.48401999;-1.24076;-0.75;0.055,35.25;40.01667;35.316669;35.5;34.979;38.028831;39.416672;34.301998;37.116669;36.833328;36.993999;39.666672;37.518002;37.599998;40.700001;36.76667;39.200001;38.451;34.64;34.833328;37.333328;39.644001;36.699001;35.382999;34.797001;38.533329;40.041;37.530998;34.169998;34.639;36;36.5;34.712;40.900002;37.408001;37.872002;35.549999;38.369999;36.877998;36.950001;34.480999;35.349998;34.409;35.558891;36.171261;35.7356;35;35.193001,,\nP120112,South Asia,Republic of India;Republic of India,RE,Technical Assistance Loan,IN,C,N,L,Closed,Closed,Citywide Slum Upgrading Plan for the Heritage City of Agra,2010-02-24T00:00:00Z,February,2014-04-30T00:00:00Z,500"
If i view(wb)
and scroll to line 5211
, the problem shows up there in the id column.
This is confusing on multiple fronts:
First this looks like two entries (one referencing Kenya and one referening India). They appear to be separated by a newline chr (\n) which hasn't registered.
Notably if i open api.csv
in a text editor, line 5213
(because the first line is the header) looks fine:
P116774,Europe and Central Asia,Bosnia and Herzegovina;Bosnia and Herzegovina,PE,Specific Investment Loan,IN,C,N,L,Closed,Closed,Social Safety Nets & Employment Support Project,2010-02-25T00:00:00Z,February,2015-10-31T00:00:00Z,"15,000,000","0","15,000,000","15,000,000","0",SIA A,IISTIES,http://projects.worldbank.org/P116774/social-safety-nets-employment-support-project?lang=en,,,Social Protection!$!66!$!SA,Public Administration - Social Protection!$!34!$!SG,,,,Social Protection;Social Protection;Public Administration - Social Protection,,,,,,Social Protection;Social Protection;Social Protection,Social Safety Nets/Social Assistance & Social Care Services!$!67!$!54,Improving labor markets!$!33!$!51,,,,,Corporate Advocacy Priorities;Corporate Advocacy Priorities;CAP;Corporate Advocacy Priorities;CAP,IDA47040;IDA47040,,,,,,0003277605!$!Bosnia and Herzegovina!$!44.25!$!17.83333!$!BA,0003277605,Bosnia and Herzegovina,44.25,17.83333,BA,
The correct delimiter appears after P116774
and its the same delimiter used in the rest of the file.
That said, this entry clearly refers to a project in Bosnia and Herzegovina - not Kenya or India.
I was wondering if the problem lies with the end of the previous line (5212
when read in a text editor). That line notably includes a string of commas: ,,,,,,,,,,,,,
. This seems like it could be a problem, but these are just meant to be empty entries and so should render correctly. Notably line 5209
(in the text editor) ends in the same way, but the following entry renders correctly in R. Further, none of the entries adjacent to the problematic entry refer to projects in Kenya or India.
So i cannot work out what is going on.
Notably, if I open this with excel, the id column renders as expected - see row 5213 (first row is the header).
The whole file is a mess to deal with, but this particular issue seems to lie with the way R is importing the .csv.
Any ideas how to fix this?
Upvotes: 1
Views: 100
Reputation: 146110
With irregular CSVs, it's often worth trying alternative read functions. data.table::fread
and readr::read_csv
are both good candidates.
Upvotes: 1