Maxl Gemeinderat
Maxl Gemeinderat

Reputation: 555

Data Cleaning in R - just get the numbers out of column

i have crawled some car data and no i want to clean it to work with this data. The Data Frame looks like this:

> head(cars_clean)
     car_name          car_prize     ps                km            kraftstoff  baujahr
     1 Volkswagen Lupo \n€ 399,-\n  \n37 kW (50 PS)\n \n215.000 km\n \nBenzin\n \n06/2004\n
     2      Opel Corsa \n€ 450,-\n  \n40 kW (54 PS)\n \n163.799 km\n \nBenzin\n \n01/2001\n
     3  Renault Megane \n€ 490,-\n  \n72 kW (98 PS)\n \n184.400 km\n \nBenzin\n \n07/2004\n
     4         Audi A3 \n€ 490,-\n \n92 kW (125 PS)\n \n222.000 km\n \nBenzin\n \n10/1999\n
     5      Opel Corsa \n€ 499,-\n  \n55 kW (75 PS)\n \n370.000 km\n \nDiesel\n \n03/2003\n
     6     Ford Fiesta \n€ 499,-\n  \n55 kW (75 PS)\n \n189.137 km\n \nBenzin\n \n07/2000\n

Now i want to clean for example the ps column:

> cars_clean$ps
    [1] "\n37 kW (50 PS)\n"   "\n40 kW (54 PS)\n"   "\n72 kW (98 PS)\n"  
    [4] "\n92 kW (125 PS)\n"  "\n55 kW (75 PS)\n"   "\n55 kW (75 PS)\n"  
    [7] "\n96 kW (131 PS)\n"  "\n55 kW (75 PS)\n"   "\n90 kW (122 PS)\n" 
    [10] "\n98 kW (133 PS)\n"  "\n74 kW (101 PS)\n"  "\n75 kW (102 PS)\n" 

Out of this i only want to get the PS value in the brackets, so "50" for the first value. How can i do this?

With the "car_prize" column i tried following, which has worked for me. But this solution doesn't work with the "ps" column:

clean_prize <- parse_number(cars_clean$car_prize)

This line got me just the digits in the "car_prize" column.

Thanks for your help! :)

Edit: i also want to convert the column "baujahr" (which represents the year the car was built) to a date format. Just "as.Date(cars_clean$baujahr)" didn't work...

Upvotes: 0

Views: 69

Answers (3)

akrun
akrun

Reputation: 887108

We can use

library(stringr)
readr::parse_number(str_remove(x, '[^(]*'))
#[1] 50 54 98

data

x <- c("\n37 kW (50 PS)\n","\n40 kW (54 PS)\n","\n72 kW (98 PS)\n")

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

Base R method using sub :

x <- c("\n37 kW (50 PS)\n","\n40 kW (54 PS)\n","\n72 kW (98 PS)\n")
as.numeric(sub('.*?(\\d+)\\sPS).*', '\\1', x))
#[1] 50 54 98

Upvotes: 3

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

You can extract via str_extractand lookaround:

library(stringr)
str_extract(x, "(?<=\\()\\d+(?= PS)")
[1] "50" "54" "98"

This picks out any number of digits that are preceded to the left by (and followed to the right by PS.

Data:

x <- c("\n37 kW (50 PS)\n","\n40 kW (54 PS)\n","\n72 kW (98 PS)\n")

Upvotes: 3

Related Questions