Varun S
Varun S

Reputation: 25

Problem in storing dataframe to csv format

I am currently using the below code to web scrape data and then store it in a CSV file.

from bs4 import BeautifulSoup 
import requests

url='https://www.business-standard.com/rss/companies-101.rss'
soup = BeautifulSoup(requests.get(url).content, 'xml')

news_items = []

for item in soup.findAll('item'):
    news_item = {}
    news_item['title'] = item.title.text
    news_item['excerpt'] = item.description.text

    print(item.link.text)
    s = BeautifulSoup(requests.get(item.link.text).content, 'html.parser')

    news_item['text'] = s.select_one('.p-content').get_text(strip=True, separator=' ')
    news_item['link'] = item.link.text
    news_item['pubDate'] = item.pubDate.text
    news_item['Category'] = 'Company'
    news_items.append(news_item)

import pandas as pd
df = pd.DataFrame(news_items)
df.to_csv('company_data.csv',index = False)

When displaying the data frame, the results look fine as attached.enter image description here But while opening the csv file, the columns are not as expected. enter image description hereCan anyone tell me the reason.

Upvotes: 1

Views: 2321

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148870

A csv file is not an Excel file despite what Microsoft pretends but is a text file containing records and fields. The records are separated with sequences of \r\n and the fields are separated with a delimiter, normally the comma. Fields can contain new lines or delimiters provided they are enclosed in quotation marks.

But Excel is known to have a very poor csv handling module. More exactly it can read what it has written, or what is formatted the way it would have written it. To be locale friendly, MS folks decided that they will use the locale decimal separator (which is the comma , in some West European locales) and will use another separator (the semicolon ; when the comma is the decimal separator). As a result, using Excel to read CSV files produced by other tools is a nightmare with possible workarounds like changing the separator when writing the CSV file, but no clean way. LibreOffice is far behind MS Office for most features except CSV handling. So my advice is to avoid using Excel for CSV files but use LibreOffice Calc instead.

Upvotes: 0

RamWill
RamWill

Reputation: 318

The issue is that your data contains commas and the default seperator for to_csv is "," So each comma in your data set is treated as a seperate column.

If you perform df.to_excel('company_data.xlsx', index=False) you won't have this issue since it is not comma seperated.

Upvotes: 2

Related Questions