Reputation: 181
I have a Pyspark dataframe as below and need to create a new dataframe with only one column made up of all the 7 digit numbers in the original dataframe. The values are all strings. Column1 should be ignored. Ignoring non-numbers and single 7 digit numbers in Column2 is fairly straightforward, but for the values that have two separate 7 digit numbers, I'm having difficulty pulling them out individually. This needs to be automated and able to run on other similar dataframes. The numbers are always 7 digits and always begin with a '1.' Any tips?
+-----------+--------------------+
| COLUMN1| COLUMN2|
+-----------+--------------------+
| Value1| Something|
| Value2| 1057873 1057887|
| Value3| Something Something|
| Value4| null|
| Value5| 1312039|
| Value6| 1463451 1463485|
| Value7| Not In Database|
| Value8| 1617275 1617288|
+-----------+--------------------+
The resulting dataframe should be as below:
+-------+
|Column1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+
The responses are great, but unfortunately I'm using a older version of Spark that doesn't agree. I used the below to solve the problem, though it's a bit clunky...it works.
from pyspark.sql import functions as F
new_df = df.select(df.COLUMN2)
new_df = new_df.withColumn('splits', F.split(new_df.COLUMN2, ' '))
new_df = new_df.select(F.explode(new_df.splits).alias('column1'))
new_df = new_df.filter(new_df.column1.rlike('\d{7}'))
Upvotes: 2
Views: 379
Reputation: 26676
I like @nky and voted for it. An alternative Can also use pysparks exists in a higher order function in 3.0+
new = df.selectExpr("explode(split(COLUMN2,' ')) as COLUMN1").where(F.expr("exists(array(COLUMN1), element -> element rlike '([0-9]{7})')"))
new.show()
+-------+
|COLUMN1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+
Upvotes: 3
Reputation: 75080
Here is an approach with higher order lambda functions for spark 2.4+ wherein we split the column by space and then filter the words which starts with 0-9 and are length n (7), then explode:
n = 7
df.selectExpr(f"""explode(filter(split(COLUMN2,' '),x->
x rlike '^[0-9]+' and length(x)={n})) as COLUMN1""").show(truncate=False)
+-------+
|COLUMN1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+
Upvotes: 3
Reputation: 260455
IIUC, you could use a regex and str.extractall
:
df2 = (df['COLUMN2'].str.extractall(r'(\b\d{7}\b)')[0]
.reset_index(drop=True).to_frame(name='COLUMN1')
)
output:
COLUMN1
0 1057873
1 1057887
2 1312039
3 1463451
4 1463485
5 1617275
6 1617288
regex:
( start capturing
\b word boundary
\d{7} 7 digits # or 1\d{6} for "1" + 6 digits
\b word boundary
) end capture
Upvotes: 2