yudhiesh
yudhiesh

Reputation: 6809

How to fix broken CSV file where column values are not formatted properly?

I have a dataframe that has a weird format that I am having difficulty formatting it to a desired format. I just need the columns first_name, last_name, domain, Email, Verification and status but am not sure how to remove it when it is in this format.

,first_name,last_name,domain,Email,Score,Verification status,Source 1,Source 2,Source 3,Source 4,Source 5,Source 6,Source 7,Source 8,Source 9,Source 10,Source 11,Source 12,Source 13,Source 14,Source 15,Source 16,Source 17,Source 18,Source 19,Source 20,Source 21,Source 22,Source 23,Source 24,Source 25,Source 26,Source 27,Source 28,Source 29,Source 30
25   138,Ethan,Gach,kotaku.com,[email protected]...                                                                                                                                                                                                                                                                                                             
79   535,Carla,Mozée,businessinsider.com,cmozee@bus...                                                                                                                                                                                                                                                                                                             
114  814,Neil,Irwin,nytimes.com,neil.irwin@nytimes....                     

Expected output:

  first_name last_name               domain                       Email Score Verification status
0      Carla     Mozée  businessinsider.com  [email protected]    99               valid

Sample of the dataframe as a dictionary:

{',first_name,last_name,domain,Email,Score,Verification status,Source 1,Source 2,Source 3,Source 4,Source 5,Source 6,Source 7,Source 8,Source 9,Source 10,Source 11,Source 12,Source 13,Source 14,Source 15,Source 16,Source 17,Source 18,Source 19,Source 20,Source 21,Source 22,Source 23,Source 24,Source 25,Source 26,Source 27,Source 28,Source 29,Source 30': {25: '138,Ethan,Gach,kotaku.com,[email protected],99,valid,http://kotaku.com/new-mod-lets-you-play-game-boy-advance-games-in-hd-on-a-1830834944,http://compete.kotaku.com/the-weekend-in-esports-boston-counter-strike-party-1822479309,http://kotaku.com/madden-nfl-21-will-be-the-first-in-the-series-on-steam-1844050894,http://kotaku.com/gogs-mysterious-curation-process-rejected-opus-magnum-1821843050,http://kotaku.com/a-dramatic-podcast-fills-in-the-gaps-of-fallout-76s-bro-1842064555,http://kotaku.com/electronic-arts-says-its-customers-just-arent-that-inte-1836862047,http://compete.kotaku.com/the-weekend-in-esports-counter-strike-pokemon-and-su-1821353758,http://compete.kotaku.com/brazilians-prove-unstoppable-at-australian-counter-stri-1794995886,http://compete.kotaku.com/street-fighter-pro-goes-looking-for-an-energy-drink-mid-1818489146,http://dustinhawkfarmily.blogspot.com/2016/09,http://compete.kotaku.com/virtus-pro-seizes-top-spot-in-dota-2-pro-circuit-1823310035,http://deniseuncensored.blogspot.com/2016/05,http://destinyisbae.com/deaf-destiny-player-petitions-bungie-for-captioning,http://destinyisbae.com/destiny-player-destroys-latest-raid-boss-all-by-himself,http://kinja.com/ethangach/following,http://kinja.com/ethangach/followers,http://cosplay.kotaku.com/police-mistake-fallout-cosplay-for-bomb-1794353299,http://cosplay.kotaku.com/tracer-and-emily-cosplay-celebrates-romantic-overwatch-1790505557,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  79: '535,Carla,Mozée,businessinsider.com,[email protected],99,valid,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  114: '814,Neil,Irwin,nytimes.com,[email protected],96,valid,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  435: '5308,Alyssa,Therrien,dailyhive.com,[email protected],96,valid,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  478: '6007,Bruce,Kamich,thestreet.com,[email protected],97,valid,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  494: '6255,Zach,Brendza,triblive.com,[email protected],96,valid,http://triblive.com/news/all-stories,http://triblive.com/author/zach-brendza,http://triblive.com/aande/movies-tv/mighty-ducks-fly-together-for-reunion-at-nhl-game-secret-project,http://triblive.com/business/technology/gillette-ad-brings-mixed-response-400k-youtube-dislikes,http://triblive.com/aande/movies-tv/taylor-swift-confirms-cats-role-on-instagram,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  518: '6719,Bernhard,Warner,fortune.com,[email protected],96,valid,http://asherradio.com/disease-recession-war-games-why-none-of-it-seems-to-count-as-bad-news-for-todays-investors,http://mimicnews.com/could-hong-kong-tensions-sink-the-global-stocks-rally,http://mainstreetalerts.com/2020/01/31/the-coronavirus-is-already-disrupting-the-global-supply-chain-starting-with-these-commodities,http://mainstreetalerts.com/2020/03/16/global-markets-tumble-despite-the-feds-best-efforts-to-restore-confidence,http://mainstreetalerts.com/2020/03/17/global-markets-remain-choppy-after-mondays-historic-plunge,http://isreally.com/2020/02/21/looking-more-probable-analysts-up-the-odds-of-a-markets-correction-as-coronavirus-fears-surge,http://isreally.com/tag/bear-market,http://isreally.com/tag/bull-sheet,http://isreally.com/tag/nasdaq,http://isreally.com/tag/sp-500,http://fortunemediakit.com/editorial-bios,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  534: '6965,Clarence,Page,chicagotribune.com,[email protected],98,valid,http://chicagotribune.com/columns/clarence-page/ct-page-police-trump-speech-brutality-perspec-0802-jm-20170801-story.html,http://latimes.com:443/newsletters/politics/ct-page-portland-attack-hate-trump-perspec-0531-jm-20170530-story.html,http://chicagotribune.com/columns/clarence-page/ct-column-truth-fake-news-literacy-page-20201126-rpzogqv2fvdnnmzh6x4py7oblu-story.html,http://chicagotribune.com/columns/clarence-page/ct-column-trump-hawley-twitter-first-amendment-page-20210112-qdt6ehyk3fecnbha5lw4ginwa4-story.html,http://chicagotribune.com/columns/clarence-page/ct-perspec-page-caravan-donald-trump-judge-jon-tigar-1125-20181121-story.html,http://chicagotribune.com/news/columnists/chinews-clarence-page-20130507-staff.html,http://businessandcommunitypartners.blogspot.com/2015/08/i-dont-believe-arc-of-justice-bends-our.html,http://businessandcommunitypartners.blogspot.com/2015/08,http://arcamax.com/politics/fromtheleft/clarencepage/s-2361828-p2,http://arcamax.com/politics/fromtheleft/clarencepage/s-2368961,http://arcamax.com/politics/fromtheleft/clarencepage/s-2314283,http://arcamax.com/politics/fromtheleft/clarencepage/s-2357041,http://destee.com/threads/rachel-dolezals-naturalhair-politics.87634,http://arcamax.com/politics/fromtheleft/clarencepage/s-2302572,http://newsday.com/opinion/commentary/dems-lawsuit-not-as-nutty-as-it-sounds-1.18269124,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  558: '7730,Brenden,Rearick,investorplace.com,[email protected],95,valid,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',
  592: '8639,Ian,Chee,lowyat.net,[email protected],98,valid,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'}}

Sample from the original CSV file:

,first_name,last_name,domain,Email,Score,Verification status,Source 1,Source 2,Source 3,Source 4,Source 5,Source 6,Source 7,Source 8,Source 9,Source 10,Source 11,Source 12,Source 13,Source 14,Source 15,Source 16,Source 17,Source 18,Source 19,Source 20,Source 21,Source 22,Source 23,Source 24,Source 25,Source 26,Source 27,Source 28,Source 29,Source 30
0,Jon,Levine,nypost.com,[email protected],87,accept_all,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Sujata,Rao,reuters.com,[email protected],94,accept_all,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Tony,Dokoupil,cbsnews.com,[email protected],87,accept_all,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,Pippa,Stevens,cnbc.com,[email protected],94,unknown,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,Clarissa-jan,Lim,buzzfeednews.com,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6,Thomas,Franck,cnbc.com,[email protected],92,unknown,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
7,Matt,Phillips,nytimes.com,[email protected],97,valid,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
8,Jake,Johnson,commondreams.org,[email protected],90,accept_all,http://portside.org/node/19308/printable/print,http://beatnikmalcontent.blogspot.com/2019,http://beatnikmalcontent.blogspot.com/2019/02,http://beatnikmalcontent.blogspot.com/2019/02/socialism-is-more-popular-than-you.html,http://portside.org/2019-02-07/socialism-more-popular-you-think-mr-president-ocasio-cortez-says-trump-attack-shows,http://commondreams.org/key-staff,http://24ahead.com/u/johnsonjakep,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11,Graig,Graziosi,independent.co.uk,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Upvotes: 1

Views: 1357

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195468

You can use usecols=:

df = pd.read_csv("your_data.txt", usecols=[1, 2, 3, 4, 6])
print(df)

Prints:

     first_name last_name             domain                      Email Verification status
0           Jon    Levine         nypost.com         [email protected]          accept_all
1        Sujata       Rao        reuters.com     [email protected]          accept_all
2          Tony  Dokoupil        cbsnews.com      [email protected]          accept_all
3         Pippa   Stevens           cnbc.com  [email protected]             unknown
4  Clarissa-jan       Lim   buzzfeednews.com                        NaN                 NaN
5        Thomas    Franck           cnbc.com     [email protected]             unknown
6          Matt  Phillips        nytimes.com  [email protected]               valid
7          Jake   Johnson   commondreams.org      [email protected]          accept_all
8         Graig  Graziosi  independent.co.uk                        NaN                 NaN

Upvotes: 2

RJ Adriaansen
RJ Adriaansen

Reputation: 9629

You can read the file with pandas.read_csv() with error_bad_lines=False:

import pandas as pd
df = pd.read_csv('filename.csv', error_bad_lines=False)
df = df[['first_name', 'last_name', 'domain', 'Email', 'Verification status']]

This will output:

first_name last_name domain Email Verification status
0 Jon Levine nypost.com [email protected] accept_all
1 Sujata Rao reuters.com [email protected] accept_all
2 Tony Dokoupil cbsnews.com [email protected] accept_all
3 Pippa Stevens cnbc.com [email protected] unknown
4 Clarissa-jan Lim buzzfeednews.com nan nan
5 Thomas Franck cnbc.com [email protected] unknown
6 Matt Phillips nytimes.com [email protected] valid
7 Graig Graziosi independent.co.uk nan nan

You can save it with df.to_csv('filename.csv')

Upvotes: 1

Related Questions