Reputation: 519
I received a dataset to analyze, the first rows are:
day;lang;nil;longitude;latitude
0;it;5;918.926.528;4.545.545.373
0;ar;28;92.235.078;4.545.206.021
0;es;65;91.225.254;45.497.917
0;it;21;921.481.788;4.547.046.575
0;it;50;915.995;45.450.124
0;ar;28;922.355.903;4.545.202.558
0;ar;28;922.362.504;454.523.043
0;ar;28;922.318.987;4.545.211.746
The problem is the following: the columns of longitude and latitude are not well formatted in the sense that, since I know where these data are taken from, Longitude should be "9,..." and Latitude should be "45,...."
As you can see also the number of digits in each record of longitude and latitude is different!
Any idea on how to solve this (possibly in R)?
Upvotes: 3
Views: 101
Reputation: 12937
This is what you can do using base
R:
sep <- "."
# remove dots from latitude and longitude
dt$latitude <- gsub("\\.","",dt$latitude)
dt$longitude <- gsub("\\.","",dt$longitude)
# insert your sep at your desired position
dt$longitude <- as.numeric(paste0(substring(dt$longitude,1,1), sep, substring(dt$longitude,2)))
dt$latitude <- as.numeric(paste0(substring(dt$latitude,1,2), sep, substring(dt$latitude,3)))
# day lang nil longitude latitude
# 1 0 it 5 9.189265 45.45545
# 2 0 ar 28 9.223508 45.45206
# 3 0 es 65 9.122525 45.49792
# 4 0 it 21 9.214818 45.47047
# 5 0 it 50 9.159950 45.45012
# 6 0 ar 28 9.223559 45.45203
# 7 0 ar 28 9.223625 45.45230
# 8 0 ar 28 9.223190 45.45212
DATA
dt <- read.table(text = "day;lang;nil;longitude;latitude
0;it;5;918.926.528;4.545.545.373
0;ar;28;92.235.078;4.545.206.021
0;es;65;91.225.254;45.497.917
0;it;21;921.481.788;4.547.046.575
0;it;50;915.995;45.450.124
0;ar;28;922.355.903;4.545.202.558
0;ar;28;922.362.504;454.523.043
0;ar;28;922.318.987;4.545.211.746", header=T,sep=";")
Upvotes: 0
Reputation: 78792
If you are "stuck" with this data, then this might help:
library(dplyr)
df <- read.table(text="day;lang;nil;longitude;latitude
0;it;5;918.926.528;4.545.545.373
0;ar;28;92.235.078;4.545.206.021
0;es;65;91.225.254;45.497.917
0;it;21;921.481.788;4.547.046.575
0;it;50;915.995;45.450.124
0;ar;28;922.355.903;4.545.202.558
0;ar;28;922.362.504;454.523.043
0;ar;28;922.318.987;4.545.211.746",
header=TRUE, sep=";", stringsAsFactors=FALSE)
cleanup <- function(x, n) {
x <- gsub("\\.", "", x)
x <- sub(sprintf("^%s", n), sprintf("%s\\.", n), x)
as.numeric(x)
}
df %>%
mutate(longitude=cleanup(longitude, 9),
latitude=cleanup(latitude, 45))
## day lang nil longitude latitude
## 1 0 it 5 9.189265 45.45545
## 2 0 ar 28 9.223508 45.45206
## 3 0 es 65 9.122525 45.49792
## 4 0 it 21 9.214818 45.47047
## 5 0 it 50 9.159950 45.45012
## 6 0 ar 28 9.223559 45.45203
## 7 0 ar 28 9.223625 45.45230
## 8 0 ar 28 9.223190 45.45212
Upvotes: 3