mariob6
mariob6

Reputation: 519

Change numeric format, wrong format in dataset

I received a dataset to analyze, the first rows are:

day;lang;nil;longitude;latitude
0;it;5;918.926.528;4.545.545.373
0;ar;28;92.235.078;4.545.206.021
0;es;65;91.225.254;45.497.917
0;it;21;921.481.788;4.547.046.575
0;it;50;915.995;45.450.124
0;ar;28;922.355.903;4.545.202.558
0;ar;28;922.362.504;454.523.043
0;ar;28;922.318.987;4.545.211.746

The problem is the following: the columns of longitude and latitude are not well formatted in the sense that, since I know where these data are taken from, Longitude should be "9,..." and Latitude should be "45,...."

As you can see also the number of digits in each record of longitude and latitude is different!

Any idea on how to solve this (possibly in R)?

Upvotes: 3

Views: 101

Answers (2)

989
989

Reputation: 12937

This is what you can do using base R:

sep <- "."

# remove dots from latitude and longitude
dt$latitude <- gsub("\\.","",dt$latitude)
dt$longitude <- gsub("\\.","",dt$longitude)

# insert your sep at your desired position
dt$longitude <- as.numeric(paste0(substring(dt$longitude,1,1), sep, substring(dt$longitude,2)))
dt$latitude <- as.numeric(paste0(substring(dt$latitude,1,2), sep, substring(dt$latitude,3)))

  # day lang nil longitude latitude
# 1   0   it   5  9.189265 45.45545
# 2   0   ar  28  9.223508 45.45206
# 3   0   es  65  9.122525 45.49792
# 4   0   it  21  9.214818 45.47047
# 5   0   it  50  9.159950 45.45012
# 6   0   ar  28  9.223559 45.45203
# 7   0   ar  28  9.223625 45.45230
# 8   0   ar  28  9.223190 45.45212

DATA

dt <- read.table(text = "day;lang;nil;longitude;latitude
    0;it;5;918.926.528;4.545.545.373
                  0;ar;28;92.235.078;4.545.206.021
                  0;es;65;91.225.254;45.497.917
                  0;it;21;921.481.788;4.547.046.575
                  0;it;50;915.995;45.450.124
                  0;ar;28;922.355.903;4.545.202.558
                  0;ar;28;922.362.504;454.523.043
                  0;ar;28;922.318.987;4.545.211.746", header=T,sep=";")

Upvotes: 0

hrbrmstr
hrbrmstr

Reputation: 78792

If you are "stuck" with this data, then this might help:

library(dplyr)

df <- read.table(text="day;lang;nil;longitude;latitude
0;it;5;918.926.528;4.545.545.373
0;ar;28;92.235.078;4.545.206.021
0;es;65;91.225.254;45.497.917
0;it;21;921.481.788;4.547.046.575
0;it;50;915.995;45.450.124
0;ar;28;922.355.903;4.545.202.558
0;ar;28;922.362.504;454.523.043
0;ar;28;922.318.987;4.545.211.746", 
           header=TRUE, sep=";", stringsAsFactors=FALSE)

cleanup <- function(x, n) {
  x <- gsub("\\.", "", x)
  x <- sub(sprintf("^%s", n), sprintf("%s\\.", n), x)
  as.numeric(x)
}

df %>% 
  mutate(longitude=cleanup(longitude, 9),
         latitude=cleanup(latitude, 45))

##   day lang nil longitude latitude
## 1   0   it   5  9.189265 45.45545
## 2   0   ar  28  9.223508 45.45206
## 3   0   es  65  9.122525 45.49792
## 4   0   it  21  9.214818 45.47047
## 5   0   it  50  9.159950 45.45012
## 6   0   ar  28  9.223559 45.45203
## 7   0   ar  28  9.223625 45.45230
## 8   0   ar  28  9.223190 45.45212

Upvotes: 3

Related Questions