Reputation: 555
i have crawled some car data and no i want to clean it to work with this data. The Data Frame looks like this:
> head(cars_clean)
car_name car_prize ps km kraftstoff baujahr
1 Volkswagen Lupo \n€ 399,-\n \n37 kW (50 PS)\n \n215.000 km\n \nBenzin\n \n06/2004\n
2 Opel Corsa \n€ 450,-\n \n40 kW (54 PS)\n \n163.799 km\n \nBenzin\n \n01/2001\n
3 Renault Megane \n€ 490,-\n \n72 kW (98 PS)\n \n184.400 km\n \nBenzin\n \n07/2004\n
4 Audi A3 \n€ 490,-\n \n92 kW (125 PS)\n \n222.000 km\n \nBenzin\n \n10/1999\n
5 Opel Corsa \n€ 499,-\n \n55 kW (75 PS)\n \n370.000 km\n \nDiesel\n \n03/2003\n
6 Ford Fiesta \n€ 499,-\n \n55 kW (75 PS)\n \n189.137 km\n \nBenzin\n \n07/2000\n
Now i want to clean for example the ps column:
> cars_clean$ps
[1] "\n37 kW (50 PS)\n" "\n40 kW (54 PS)\n" "\n72 kW (98 PS)\n"
[4] "\n92 kW (125 PS)\n" "\n55 kW (75 PS)\n" "\n55 kW (75 PS)\n"
[7] "\n96 kW (131 PS)\n" "\n55 kW (75 PS)\n" "\n90 kW (122 PS)\n"
[10] "\n98 kW (133 PS)\n" "\n74 kW (101 PS)\n" "\n75 kW (102 PS)\n"
Out of this i only want to get the PS value in the brackets, so "50" for the first value. How can i do this?
With the "car_prize" column i tried following, which has worked for me. But this solution doesn't work with the "ps" column:
clean_prize <- parse_number(cars_clean$car_prize)
This line got me just the digits in the "car_prize" column.
Thanks for your help! :)
Edit: i also want to convert the column "baujahr" (which represents the year the car was built) to a date format. Just "as.Date(cars_clean$baujahr)" didn't work...
Upvotes: 0
Views: 69
Reputation: 887108
We can use
library(stringr)
readr::parse_number(str_remove(x, '[^(]*'))
#[1] 50 54 98
x <- c("\n37 kW (50 PS)\n","\n40 kW (54 PS)\n","\n72 kW (98 PS)\n")
Upvotes: 0
Reputation: 388982
Base R method using sub
:
x <- c("\n37 kW (50 PS)\n","\n40 kW (54 PS)\n","\n72 kW (98 PS)\n")
as.numeric(sub('.*?(\\d+)\\sPS).*', '\\1', x))
#[1] 50 54 98
Upvotes: 3
Reputation: 21400
You can extract via str_extract
and lookaround:
library(stringr)
str_extract(x, "(?<=\\()\\d+(?= PS)")
[1] "50" "54" "98"
This picks out any number of d
igits that are preceded to the left by (
and followed to the right by PS
.
Data:
x <- c("\n37 kW (50 PS)\n","\n40 kW (54 PS)\n","\n72 kW (98 PS)\n")
Upvotes: 3