Reputation: 25
Imagine that I have that dataframe:
data = {'Script': ["create table table_name ( col_1 string , col_2 string , col_3 string ) row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties ( 'field.delim' ='\t' , 'serialization.format' ='\t' , 'serialization.null.format'='' ) stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties ( 'parquet.compress'='snappy' );"]}
df = pd.DataFrame(data)
Basically, the content of the column is DDL:
create table table_name
(
col_1 string
, col_2 string
, col_3 string
)
row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties
(
'field.delim' ='\t'
, 'serialization.format' ='\t'
, 'serialization.null.format'=''
)
stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties
(
'parquet.compress'='snappy'
)
What I need to do is to remove all the content between the fist "(" and the word "location". Basically my expected output is the following one:
create table table_name
(
col_1 string
, col_2 string
, col_3 string
)
location 'hdfs://nameservice1/table_name'tblproperties
(
'parquet.compress'='snappy'
)
For that I am trying to use a regex approach:
df['DDL'] = df.Script.str.replace(r")", " } ").str.replace(r'<}^>location+>', "")
However, the result was not the desired one:
create table table_name
(
col_1 string
, col_2 string
, col_3 string
}
row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties
(
'field.delim' ='\t'
, 'serialization.format' ='\t'
, 'serialization.null.format'='' } stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties ( 'parquet.compress'='snappy' }
;
What I am doing wrong? With my approach I am trying to extract between { and location...
Upvotes: 1
Views: 105
Reputation: 626794
You may use
df['DDL'] = df['Script'].str.replace(r"(?s)^([^)]*)\).*?\b(location)\b", r"\1\2")
See the regex demo
Details
(?s)
- an inline re.DOTALL
option making .
match line break chars^
- start of string([^)]*)
- Group 1 (\1
in the replacement pattern): any 0+ chars other than )
\)
- a )
char.*?
- any 0+ chars, as few as possible (*?
is a non-greedy quantifier)\b(location)\b
- Group 2 (\2
in the replacement pattern) capturing a whole word location
(\b
stand for word boundaries)Upvotes: 1