Reputation: 8584
I have a number of strings from a 3rd party data source that are in various lengths that contain both underscore and spaces. Each portion of the string is important and I am trying to break it apart into various fields via python
. The string does not have special characters (\n
, \t
, etc.) - should just be spaces, underscores, and parentheses are used to break the data parts.
String | Year | State | ID | Sub ID | Extra1 | Extra2 |
---|---|---|---|---|---|---|
2022_UT_T1000_100 (Classification1 Classification2) | 2022 | UT | T1000 | 100 | Classification1 | Classification2 |
2021_TX_V999_005 (Classification1) | 2021 | TX | V999 | 005 | Classification1 | |
1999_GA_123456_7890 | 1999 | GA | 123456 | 7890 |
I could split the string by the underscore, then split the last field by a space but that seems error-prone. Regex
is likely the best approach.
I can match the year using this: ^[1-9]\d{3,}$
. However, when trying to add an OR operator, it will only find the underscore.
Is there a way to extract this data when I know a pattern exists?
Upvotes: 2
Views: 794
Reputation: 521053
You could try using str.extract
with the regex pattern:
^(\d{4})_([^_]+)_([^_]+)_([^_ ]+)(?: \((\S+)(?: (\S+))?\))?$
Note that this pattern assumes that there would only be three variants in the string column, namely no extras, one extra, or at most two extras. For arbitrary number of words in parentheses, we would need a different approach.
Python script:
df[["Year", "State", "ID", "Sub ID", "Extra1", "Extra2"]] = df["String"].str.extract(r'^(\d{4})_([^_]+)_([^_]+)_([^_ ]+)(?: \((\S+)(?: (\S+))?\))?$')
Here is a regex demo showing that the pattern is working for all variants of your string column.
Upvotes: 2