Reputation: 126
I have the following data frame containing many authors along with their affiliations. Data Frame_Before
In the affiliation column, there is a pattern 'Department of ...,' I need to split this pattern for each author. Please note that this pattern for each row (author) can occur more than once. I need to split all "department of ... ," patterns for each author and store in a separate column or row assigned to that author. (I need to do it in Python.) The below image indicates the expected result. Expected result
I would greatly appreciate any help.
Upvotes: 3
Views: 126
Reputation: 3989
In order to facilitate the separation and the subsequent assignment to new columns you can use the extractall
which returns rows with multiindex
that can be easily rearranged in columns with unstack
.
Input used as data.csv
Author_ID,Affiliation
6504356384,"Department of Cell and Developmental Biology, University of Michigan, Ann Arbor, Ml 48109, United States, Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Ml 48109, United States"
57194644787,"Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX, United States, Texas Children's Microbiome Center, Texas Children's Hospital, Houston, TX, United States, Department of Pathology, Texas Children's Ho:"
57194687826,"Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University, London, ON N6A 2C1, Canada, Department of Computer Science, Faculty of Science, Western University, London, ON N6A 2C1, Canada, Depart"
123456789,"Department of RegexTest, Address and Numbers, Department of RegexTest, Faculty of Patterns, Department of RegexTest, Department of RegexTest, City and Place"
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
dept_names = df["Affiliation"].str.extractall(r"(Department of .*?),")
affdf = dept_names.unstack()[0].add_prefix("Affiliation")
affdf = affdf.rename_axis("", axis="columns")
affdf.insert(0, "Author_ID", df["Author_ID"])
print(affdf)
Output from affdf
Author_ID Affiliation0 Affiliation1 Affiliation2 Affiliation3
0 6504356384 Department of Cell an... Department of Computa... NaN NaN
1 57194644787 Department of Patholo... Department of Pathology NaN NaN
2 57194687826 Department of Biochem... Department of Compute... NaN NaN
3 123456789 Department of RegexTest Department of RegexTest Department of RegexTest Department of RegexTest
Upvotes: 1
Reputation: 352
This could be done using the "re" module and looking for pattern - "(Department of .*?)," .
Suggested snipped :
import re
re.findall("(Department of .*?),","Department of Oncology, aadsf, afasdf, Department of Computer science, asf asfa, asfas, ")
Output : ['Department of Oncology', 'Department of Computer science']
Upvotes: 0