Reputation: 135
I am a novice in coding and I generally use R for this (stringr) but I started to learn also Python's syntax.
I have a data frame with one column generated from an imported excel file. The values in this column contain both capital and smallcase characters, symbols and numbers.
I would like to generate a second column in the data frame containing only some of these words included in the first column according to a regex pattern.
df = pd.DataFrame(["THIS IS A TEST 123123. s.m.", "THIS IS A Test test 123 .s.c.e", "TESTING T'TEST 123 da."],columns=['Test'])
df
Now, to extract what I want (words in capital case), in R I would generally use:
df <- str_extract_all(df$Test, "\\b[A-Z]{1,}\\b", simplify = FALSE)
to extract the matches of the regular expression in different data frame rows, which are:
* THIS IS A TEST
* THIS IS A
* TESTING T TEST
I couldn't find a similar solution for Python, and the closest I've got to is the following:
df["Name"] = df["Test"].str.extract(r"(\b[A-Z]{1,}\b)", expand = True)
Unfortunately this does not work, as it exports only the groups rather than the matches of the regex. I've tried multiple strategies, but also str.extractall
does not seem to work ("TypeError: incompatible index of inserted column with frame index)
How can I extract the information I want with Python?
Thanks!
Upvotes: 1
Views: 577
Reputation: 12714
You are on the right track of getting the pattern. This solution uses regular expression, join and map.
df['Name'] = df['Test'].map(lambda x: ' '.join(re.findall(r"\b[A-Z\s]+\b", x)))
Result:
Test Name
0 THIS IS A TEST 123123. s.m. THIS IS A TEST
1 THIS IS A Test test 123 .s.c.e THIS IS A
2 TESTING T'TEST 123 da. TESTING T TEST
Upvotes: 1
Reputation: 1570
If I understand well, you can try:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)")
.unstack().fillna('').apply(' '.join, 1)
[EDIT]: Here is a shorter version I discovered by looking at the doc:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)").unstack(fill_value='').apply(' '.join, 1)
Upvotes: 1