Anita Hb
Anita Hb

Reputation: 126

How to split string in a column of data frame with all occurence of specific patterns in Python

I have the following data frame containing many authors along with their affiliations. Data Frame_Before

In the affiliation column, there is a pattern 'Department of ...,' I need to split this pattern for each author. Please note that this pattern for each row (author) can occur more than once. I need to split all "department of ... ," patterns for each author and store in a separate column or row assigned to that author. (I need to do it in Python.) The below image indicates the expected result. Expected result

I would greatly appreciate any help.

Upvotes: 3

Views: 126

Answers (2)

n1colas.m
n1colas.m

Reputation: 3989

In order to facilitate the separation and the subsequent assignment to new columns you can use the extractall which returns rows with multiindex that can be easily rearranged in columns with unstack.

Input used as data.csv

Author_ID,Affiliation
6504356384,"Department of Cell and Developmental Biology, University of Michigan, Ann Arbor, Ml 48109, United States, Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Ml 48109, United States"
57194644787,"Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, United States, Texas Children's Microbiome Center, Texas Children's Hospital, Houston, TX, United States, Department of Pathology, Texas Children's Ho:"
57194687826,"Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University, London, ON N6A 2C1, Canada, Department of Computer Science, Faculty of Science, Western University, London, ON N6A 2C1, Canada, Depart"
123456789,"Department of RegexTest, Address and Numbers, Department of RegexTest, Faculty of Patterns, Department of RegexTest, Department of RegexTest, City and Place"
import pandas as pd

df = pd.read_csv("data.csv")
print(df)

dept_names = df["Affiliation"].str.extractall(r"(Department of .*?),")

affdf = dept_names.unstack()[0].add_prefix("Affiliation")
affdf = affdf.rename_axis("", axis="columns")
affdf.insert(0, "Author_ID", df["Author_ID"])

print(affdf)

Output from affdf

     Author_ID              Affiliation0              Affiliation1             Affiliation2             Affiliation3
0   6504356384  Department of Cell an...  Department of Computa...                      NaN                      NaN
1  57194644787  Department of Patholo...   Department of Pathology                      NaN                      NaN
2  57194687826  Department of Biochem...  Department of Compute...                      NaN                      NaN
3    123456789   Department of RegexTest   Department of RegexTest  Department of RegexTest  Department of RegexTest

Upvotes: 1

Prashanth Mariswamy
Prashanth Mariswamy

Reputation: 352

This could be done using the "re" module and looking for pattern - "(Department of .*?)," .

Suggested snipped :

import re
re.findall("(Department of .*?),","Department of Oncology, aadsf, afasdf, Department of Computer science, asf asfa, asfas, ")

Output : ['Department of Oncology', 'Department of Computer science']

Upvotes: 0

Related Questions