Reputation: 11
I am working with huggingface transformers(Summarizers) and have got some insights into it. I am working with the facebook/bart-large-cnn model to perform text summarisation and I am running the below code:
from transformers import pipeline
summarizer = pipeline("summarization")
text= "Good Morning team, I need a help in terms of one of the functions that needs to be written on the servers.. please let me know wen are you available.. Thanks , hgjhghjgjh, 193-6757-568"
print(summarizer(str(text), min_length = int(0.1 * len(str(text))), max_length = int(0.2 * len(str(text))),do_sample=False))
But my question is that how can I apply the same pre trained model on top of my dataframe column. My dataframe looks like this:
ID Text
1 some long text here...
2 some long text here...
3 some long text here...
.... and so on for 100K rows
Now I want to apply the pre trained model to the col Text to generate a new column df['summary_Text'] from it and the resultant dataframe should look like:
ID Text Summary_Text
1 some long text here... Text summary goes here...
2 some long text here... Text summary goes here...
3 some long text here... Text summary goes here...
HOw can i get this ? ANy quick help would be highly appreciated
Upvotes: 1
Views: 3432
Reputation: 1
this is my code to iterate through excel rows from column X and get summarization in another column Y, hope this can help you
from transformers import pipeline
import openpyxl
wb = openpyxl.load_workbook(wb, read_only=False)
ws = wb["sheet"]
bart_summarizer = pipeline("summarization")
for row in ws.iter_rows(min_col=8, min_row=2, max_col=8, max_row= 5):
for cell in row:
TEXT_TO_SUMMARIZE = cell.value
summary = bart_summarizer(TEXT_TO_SUMMARIZE, min_length=10, max_length=100)
r = cell.row
ws.cell(row=r, column=10).value = str(summary)
wb.save(wb)
Upvotes: 0
Reputation: 432
I am working on the same line trying to summarize news articles. You can input either strings or lists to the model. First convert your dataframe 'Text' column to a list:
input_col = df['Text'].to_list()
Then feed it to your model:
from transformers import pipeline
summarizer = pipeline("summarization")
res = summarizer(input_col, min_length = int(0.1 * len(str(text))), max_length = int(0.2 * len(str(text))),do_sample=False)
print(res[0]['summary_text])
This gives back a list and prints only first output of it. You can recurse over the list (res[1]['summary_text']..res[2]['summary_text'] and so on....) and store it and add it back as a dataframe column.
df_res = []
for i in range(len(res)):
df_res.append(res[i]['summary_text'])
df['Summary_Text'] = df_res
Use truncation=True as input parameter (where you input min_length etc.) for the summarizer if your articles are long.
This will take a long time using cpu. I myself am looking for faster alternatives. For me XL_net is a usable option for now. Hope this helps!
Upvotes: 2