Reputation: 25
This is the portion of my result :
Grumpier Old Men (1995)
Death Note: Desu nôto (2006–2007)
Irwin & Fran 2013
9500 Liberty (2009)
Captive Women (1000 Years from Now) (3000 A.D.) (1952)
The Garden of Afflictions 2017
The Naked Truth (1957) (Your Past Is Showing)
Conquest 1453 (Fetih 1453) (2012)
Commune, La (Paris, 1871) (2000)
1013 Briar Lane
return:
1995
2006
2013
2009
1952
2017
1957
1453<--
1871<--
<--this part for last title is empty and supposed to be empty too
As you can see from the above,last 2 title is given wrong result.
This is my code:
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract,col
bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))"
movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
I am trying to get the year portion of the title string.
Upvotes: 0
Views: 928
Reputation: 10406
Your regex can only work for the first line. \(\d{4}\)
tries to match a (
, 4 digits and a )
. For the first line you have (1995)
which is alright. The other lines do not contain that pattern.
In your situation, we can use lookbehind and lookahead patterns to detect dates within brackets. (?<=\()
means an open bracket before. (?=–|(–)|\))
means a closing bracket after, or –
or –
which is the original character that was misencoded. Once you have covered the date in between brackets, you can cover dates that are at the end of the string without brackets: \d{4}$
.
import pyspark.sql.functions as F
bracket_regexp = "((?<=\()\d{4}(?=–|(–)|\)))"
movies_DF\
.withColumn('uu', regexp_extract("title", bracket_regex + "|(\d{4}$)", 0))\
.show(truncate=False)
+------------------------------------------------------+-------------+
|title |yearOfRelease|
+------------------------------------------------------+-------------+
|Grumpier Old Men (1995) |1995 |
|Death Note: Desu nôto (2006–2007) |2006 |
|Irwin & Fran 2013 |2013 |
|9500 Liberty (2009) |2009 |
|test 1234 test 4567 |4567 |
|Captive Women (1000 Years from Now) (3000 A.D.) (1952)|1952 |
|The Garden of Afflictions 2017 |2017 |
|The Naked Truth (1957) (Your Past Is Showing) |1957 |
|Conquest 1453 (Fetih 1453) (2012) |2012 |
|Commune, La (Paris, 1871) (2000) |2000 |
|1013 Briar Lane | |
+------------------------------------------------------+-------------+
Also you do not need to prefix the string with r
when you pass a regex to a spark function.
Upvotes: 0
Reputation: 521457
Empirically, the following regex pattern seems to be working:
(?<=[( ])\d{4}(?=\S*\)|$)
Here is a working regex demo.
Updated PySpark code:
bracket_regexp = "((?<=[( ])\d{4}(?=\S*\)|$))"
movies_DF = movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0))
movies_DF.display(10000)
The regex pattern works by matching:
(?<=[( ])
assert that what precedes is (
or a space\d{4}
match a 4 digit year(?=\S*\)|$)
assert that )
, possibly prefaced by non whitespace, follows
OR the end of the string followsUpvotes: 1
Reputation: 2696
Here is a regexp that would work:
df = df.withColumn("year", F.regexp_extract("title", "(?:[\s\(])(\d{4})(?:[–\)])?", 1))
Definitely overkill for the examples you provide, but I want to avoid capturing e.g. other numbers in the titles. Also, your regexp does not work because not all years are surrounding by brackets in your examples and sometimes you have non-numeric characters inside the brackets,.
Upvotes: 0
Reputation: 19590
You can try using the following regex: r'(?<=\()(\d+)(?=\))'
, which is inspired by this excellent answer.
For example:
movies_DF = movies_DF.withColumn('uu', regexp_extract(col("title"), r'(?<=\()(\d+)(?=\))',1))
+------------------------------------------------------------+----+
|title |uu |
+------------------------------------------------------------+----+
|Grumpier Old Men (1995) |1995|
|Happy Anniversary (1959) |1959|
|Paths (2017) |2017|
|The Three Amigos - Outrageous! (2003) |2003|
|L'obsession de l'or (1906) |1906|
|Babe Ruth Story, The (1948) |1948|
|11'0901 - September 11 (2002) |2002|
|Blood Trails (2006) |2006|
|Return to the 36th Chamber (Shao Lin da peng da shi) (1980) |1980|
|Off and Running (2009) |2009|
+------------------------------------------------------------+----+
Upvotes: 1