vkk07
vkk07

Reputation: 79

loading data into pandas

I am trying to extract license information of pip packages from pypi and then load into pandas dataframe. I did an example before to load a list comprehensions to PD. But I am not able to figure out this one...

so far, I have written.

from requests import get

import pandas as pd

import pip

url = 'https://pypi.python.org/pypi'

# packages_list = ['numpy','twisted']

installed_packages = pip.get_installed_distributions()
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
     for i in installed_packages])

packages = []
licenses = []
summarys = []

for index, package in enumerate(installed_packages_list):
    package = package.split("==")[0]
    full_url = url+'/'+ package +'/json'
    #print 'url is ' + full_url
    page = get(url+'/'+package+'/json').json()


    #print 'Package: ' + package + ', license is:' + page['info']['license'] + '. ' + page['info']['summary']
    packages.append(package)
    licenses.append(page['info']['license'])
    summarys.append(page['info']['summary'])


print packages


pd_packages = pd.DataFrame(
    {
    "packages":[packages],
    "licenses":[licenses],
    "summarys":[summarys]
    })

print pd_packages

Upvotes: 1

Views: 134

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

Try this:

def get_pkg_info(pkg, url_pat='https://pypi.python.org/pypi/{}/json'):
    r = requests.get(url_pat.format(pkg))
    if r.status_code != requests.codes.ok:
         return [pkg, None, None]
    d = r.json()
    if d and 'info' in d:
        return [pkg, d['info'].get('license'), d['info'].get('summary')]
    else:
         return [pkg, None, None]

data = [get_pkg_info(x.split('==')[0]) for x in installed_packages_list]

df = pd.DataFrame(data, columns=['package','license','summary'])

Demo:

In [166]: pd.options.display.max_rows = 15

In [167]: df = pd.DataFrame(data, columns=['package','license','summary'])

In [168]: df
Out[168]:
                package       license                                            summary
0             alabaster          None        A configurable sidebar-enabled Sphinx theme
1       anaconda-client       UNKNOWN         Anaconda Cloud command line client library
2    anaconda-navigator   Proprietary
3      anaconda-project          None                                               None
4            asn1crypto           MIT  Fast ASN.1 parser and serializer with definiti...
5               astroid          LGPL  A abstract syntax tree for Python with inferen...
6               astropy           BSD         Community-developed python astronomy tools
..                  ...           ...                                                ...
216              xarray        Apache          N-D labeled arrays and datasets in Python
217                xlrd           BSD  Library for developers to extract data from Mi...
218          xlsxwriter           BSD     A Python module for creating Excel XLSX files.
219             xlwings  BSD 3-clause  Make Excel fly: Interact with Excel from Pytho...
220                xlwt           BSD  Library to create spreadsheet files compatible...
221           xmltodict           MIT  Makes working with XML feel like you are worki...
222               yapsy           BSD                          Yet another plugin system

[223 rows x 3 columns]

Upvotes: 2

Bob Haffner
Bob Haffner

Reputation: 8483

I think the issue stems from the creation of your DataFrame (pd_packages). packages, licenses and summarys are already lists so doing this [packages] makes it a list of lists which explains the output in your comment below.

So instead of this

pd_packages = pd.DataFrame(
    {
    "packages":[packages],
    "licenses":[licenses],
    "summarys":[summarys]
    })

Try this

pd.DataFrame(
    {
    "packages":packages,
    "licenses":licenses,
    "summarys":summarys
    })

Upvotes: 0

Related Questions