Reputation: 21
I have a pyspark dataframe read from a CSV file that has a value column which contains hexadecimal values.
| date | part | feature | value" |
|----------|-------|---------|--------------|
| 20190503 | par1 | feat2 | 0x0 |
| 20190503 | par1 | feat3 | 0x01 |
| 20190501 | par2 | feat4 | 0x0f32 |
| 20190501 | par5 | feat9 | 0x00 |
| 20190506 | par8 | feat2 | 0x00f45 |
| 20190507 | par1 | feat6 | 0x0e62300000 |
| 20190501 | par11 | feat3 | 0x000000000 |
| 20190501 | par21 | feat5 | 0x03efff |
| 20190501 | par3 | feat9 | 0x000 |
| 20190501 | par6 | feat5 | 0x000000 |
| 20190506 | par5 | feat8 | 0x034edc45 |
| 20190506 | par8 | feat1 | 0x00000 |
| 20190508 | par3 | feat6 | 0x00000000 |
| 20190503 | par4 | feat3 | 0x0c0deffe21 |
| 20190503 | par6 | feat4 | 0x0000000000 |
| 20190501 | par3 | feat6 | 0x0123fe |
| 20190501 | par7 | feat4 | 0x00000d0 |
The requirement is to remove rows that contain values similar to 0x0, 0x00, 0x000, etc. which evaluate to decimal 0(zero) in the value column. The number of 0's after '0x' varies across the dataframe. Removing through pattern matching is the way I tried, but I wasn't successful.
myFile = sc.textFile("file.txt")
header = myFile.first()
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
myFile_header = myFile.filter(lambda l: "date" in l)
myFile_NoHeader = myFile.subtract(myFile_header)
myFile_df = myFile_NoHeader.map(lambda line: line.split(",")).toDF(schema)
## this is the pattern match I tried
result = myFile_df.withColumn('Test', regexp_extract(col('value'), '(0x)(0\1*\1*)',2 ))
result.show()
The other approach I used was using udf:
def convert_value(x):
return int(x,16)
Using this udf in pyspark give me
ValueError: invalid literal for int() with base 16: value
Upvotes: 2
Views: 1843
Reputation: 19445
I don't really understand your regular expression, but when you want to match all strings containing 0x0 (+any number of zeros), then you can use ^0x0+$
. Filtering with regular expression can be achieved with rlike and the tilde negates the match.
l = [('20190503', 'par1', 'feat2', '0x0'),
('20190503', 'par1', 'feat3', '0x01'),
('20190501', 'par2', 'feat4', '0x0f32'),
('20190501', 'par5', 'feat9', '0x00'),
('20190506', 'par8', 'feat2', '0x00f45'),
('20190507', 'par1', 'feat6', '0x0e62300000'),
('20190501', 'par11', 'feat3', '0x000000000'),
('20190501', 'par21', 'feat5', '0x03efff'),
('20190501', 'par3', 'feat9', '0x000'),
('20190501', 'par6', 'feat5', '0x000000'),
('20190506', 'par5', 'feat8', '0x034edc45'),
('20190506', 'par8', 'feat1', '0x00000'),
('20190508', 'par3', 'feat6', '0x00000000'),
('20190503', 'par4', 'feat3', '0x0c0deffe21'),
('20190503', 'par6', 'feat4', '0x0000000000'),
('20190501', 'par3', 'feat6', '0x0123fe'),
('20190501', 'par7', 'feat4', '0x00000d0')]
columns = ['date', 'part', 'feature', 'value']
df=spark.createDataFrame(l, columns)
expr = "^0x0+$"
df.filter(~ df["value"].rlike(expr)).show()
Output:
+--------+-----+-------+------------+
| date| part|feature| value|
+--------+-----+-------+------------+
|20190503| par1| feat3| 0x01|
|20190501| par2| feat4| 0x0f32|
|20190506| par8| feat2| 0x00f45|
|20190507| par1| feat6|0x0e62300000|
|20190501|par21| feat5| 0x03efff|
|20190506| par5| feat8| 0x034edc45|
|20190503| par4| feat3|0x0c0deffe21|
|20190501| par3| feat6| 0x0123fe|
|20190501| par7| feat4| 0x00000d0|
+--------+-----+-------+------------+
Upvotes: 5