uchiha itachi
uchiha itachi

Reputation: 195

Regex to catch only the certain part of the string

Is there universal regex to catch only the names of companies?

Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc
Q4_2018_Control4_Corp

The output should be:

American_Airlines_Group_Inc
Apple_Inc
Alcoa_Inc
Arconic_Inc
Orkla_ASA
AGCO_Corp
Autodesk_Inc

Note: The name of the company may contain symbols or numbers

Upvotes: 0

Views: 185

Answers (4)

Pushpesh Kumar Rajwanshi
Pushpesh Kumar Rajwanshi

Reputation: 18357

You can use this regex,

[a-zA-Z]+(?:_[a-zA-Z]+)*$

Your company names all start with alphabetical words and hyphen separated till end of string, for which above regex will work fine.

Here, [a-zA-Z]+ starts matching alphabetical company names, and (?:_[a-zA-Z]+)* further matches any alphabetical words having hyphen before them and $ ensures the matched string ends with the string.

Regex Demo

Python code,

import re

arr = ['Q4_2017_American_Airlines_Group_Inc','Q1_2016_Apple_Inc','Q4_2014_Alcoa_Inc','Q3_2015_Arconic_Inc','Q3_2017_Orkla_ASA','Q2_2018_AGCO_Corp','Quarter_3_2018_Autodesk_Inc']

for s in arr:
 m = re.search(r'[a-zA-Z]+(?:_[a-zA-Z]+)*$', s)
 print(s, '-->', m.group())

Prints,

Q4_2017_American_Airlines_Group_Inc --> American_Airlines_Group_Inc
Q1_2016_Apple_Inc --> Apple_Inc
Q4_2014_Alcoa_Inc --> Alcoa_Inc
Q3_2015_Arconic_Inc --> Arconic_Inc
Q3_2017_Orkla_ASA --> Orkla_ASA
Q2_2018_AGCO_Corp --> AGCO_Corp
Quarter_3_2018_Autodesk_Inc --> Autodesk_Inc

Also, if you have a single string of those company names, then you can use following code and use re.findall to list all company names,

import re

s = '''Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc'''

print(re.findall(r'(?m)[a-zA-Z]+(?:_[a-zA-Z]+)*$', s))

Prints,

['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']

Edit: As Chyngyz Akmatov raised, if name can contain numbers and in general any symbol, then this regex will get the name properly, which assumes company name starts after year part and underscore.

(?<=\d{4}_).*$

Demo handling any character in company name

Upvotes: 2

scrambler
scrambler

Reputation: 771

Assuming there are only normal letters and the names are the end of each line :

grep -o '[A-Za-z][A-Za-z_]*$' names

Upvotes: 0

Austin
Austin

Reputation: 26039

You can also use this regex:

_\d+(?:_\d+)*_(.*)

Code:

import re

lst = ['Q4_2017_American_Airlines_Group_Inc', 'Q1_2016_Apple_Inc', 'Q4_2014_Alcoa_Inc', 'Q3_2015_Arconic_Inc', 'Q3_2017_Orkla_ASA', 'Q2_2018_AGCO_Corp', 'Quarter_3_2018_Autodesk_Inc']

for x in lst:
    print(re.search(r'_\d+(?:_\d+)*_(.*)', x).group(1))

# American_Airlines_Group_Inc
# Apple_Inc                                                   
# Alcoa_Inc                                                   
# Arconic_Inc                                                 
# Orkla_ASA                                                   
# AGCO_Corp                                                    
# Autodesk_Inc

Upvotes: 0

Ajax1234
Ajax1234

Reputation: 71451

You can use re.sub:

import re
data = [re.sub('\w+\d{4}_', '', i) for i in filter(None, content.split('\n'))]

Output:

['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']

Upvotes: 0

Related Questions