Python - Pandas - Remove content between the first occurrence of a character and a fix string

Question

Imagine that I have that dataframe:

data = {'Script': ["create table table_name ( col_1 string , col_2 string , col_3 string ) row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties ( 'field.delim' ='	' , 'serialization.format' ='	' , 'serialization.null.format'='' ) stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties ( 'parquet.compress'='snappy' );"]}
df = pd.DataFrame(data)

Basically, the content of the column is DDL:

create table table_name
  (
    col_1 string
  , col_2 string
  , col_3 string
  )
  row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties
  (
    'field.delim'              ='	'
  , 'serialization.format'     ='	'
  , 'serialization.null.format'=''
  )
  stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties
  (
    'parquet.compress'='snappy'
  )

What I need to do is to remove all the content between the fist "(" and the word "location". Basically my expected output is the following one:

create table table_name
  (
    col_1 string
  , col_2 string
  , col_3 string
  )
  location 'hdfs://nameservice1/table_name'tblproperties
  (
    'parquet.compress'='snappy'
  )

For that I am trying to use a regex approach:

df['DDL'] = df.Script.str.replace(r")", " } ").str.replace(r'<}^>location+>', "")

However, the result was not the desired one:

create table table_name
  (
    col_1 string
  , col_2 string
  , col_3 string
  }
  row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties
  (
    'field.delim'              ='	'
  , 'serialization.format'     ='	'
  , 'serialization.null.format'='' } stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties ( 'parquet.compress'='snappy' }
;

What I am doing wrong? With my approach I am trying to extract between { and location...

Wiktor Stribiżew · Accepted Answer

You may use

df['DDL'] = df['Script'].str.replace(r"(?s)^([^)]*)\).*?\b(location)\b", r"\1\2")

See the regex demo

Details

(?s) - an inline re.DOTALL option making . match line break chars
^ - start of string
([^)]*) - Group 1 (\1 in the replacement pattern): any 0+ chars other than )
\) - a ) char
.*? - any 0+ chars, as few as possible (*? is a non-greedy quantifier)
\b(location)\b - Group 2 (\2 in the replacement pattern) capturing a whole word location (\b stand for word boundaries)

Python - Pandas - Remove content between the first occurrence of a character and a fix string

Answers (1)

Related Questions