Reputation: 2759
Can someone let me why the regular expression
df = df2.withColumn("extracted", F.regexp_extract("title", "[Pp]ython", 0))
Can find the pattern 'Python' or 'python' from the followng column called title
title
A fast PostgreSQL client library for Python: 3x faster than psycopg2
A project template for data science in Python
A simple python framework to build/train LUIS models
An Introduction to Stock Market Data Analysis with Python (Part 1)
Asynchronous Python
Cubr A Rubiks Cube Solver Written in Python and using Webcam Input (2013)
Python 4 Kids: Python for Kids: Python 3 Project 10
But the regular expression can't find the pattern Python or python from the following
title
Python Core Development Sprint 2016: 3.6 and beyond
Hypothesis.works articles: 3.5.0 and 3.5.1 Releases of Hypothesis for Python
Total pip packages downloaded, separated by Python versions (June August 2016)
PEP 530: Asynchronous Comprehensions in Python 3.6
Python 2.7 still reigns supreme in pip installs
CheckiO games for Python and JavaScript coders. ClassRoom support is included
VR Zero, Virtual Reality on the RaspberryPi, in Python
Thanks
Upvotes: 3
Views: 240
Reputation: 26676
Use the ignore case regex;
(?i)
-ignore or case-insensitive mode ON
Data
data=[
(1,"Python Core Development Sprint 2016: 3.6 and beyond"),
(2,"Hypothesis.works articles: 3.5.0 and 3.5.1 Releases of Hypothesis for Python"),
(3,"CheckiO games for python and JavaScript coders. ClassRoom support is included")
]
df=spark.createDataFrame(data, ['id','title'])
df.show(truncate=False)
Solution
df.withColumn('extract', F.regexp_extract(col('title'),'(?i)[P]ython',0)).show()
Outcome
+---+--------------------+-------+
| id| title|extract|
+---+--------------------+-------+
| 1|Python Core Devel...| Python|
| 2|Hypothesis.works ...| Python|
| 3|CheckiO games fo...| python|
+---+--------------------+-------+
Upvotes: 3