Reputation: 1518
I am trying to put 1. a parent attribute 2. a child attribute and 3. a grandchild text into a data frame. I am able to get the child attribute and the grandchild text to print out on the screen, but I cannot get them to go into a data frame. I get a memory error from pandas.
Here is intro stuff
import requests
from lxml import etree, objectify
r = requests.get(' security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<channel channel="97925" name="blah">
<Time Time="2013-05-01 00:00:00">
<Time Time="2013-05-01 00:01:00">
<Time Time="2013-05-01 00:02:00">
<Time Time="2013-05-01 00:03:00">
This shows how I am parsing to get the child attribute and grandchild to print.
for df in root.xpath('//channel/Time'):
## Iterate over attributes of channel/Time
for attrib in df.attrib:
print '@' + attrib + '=' + df.attrib[attrib]
## value is a child of time, and iterate
subfields = df.getchildren()
for subfield in subfields:
print 'subfield=' + subfield.text
It yields a very long print out with the information as requested:
@Time=2013-05-01 23:01:00
@Time=2013-05-01 23:02:00
@Time=2013-05-01 23:03:00
@Time=2013-05-01 23:04:00
However, when I try to put it into a data frame, I get a memory error. I tried with both of them an also with just trying to get the child attribute into a data frame.
data = []
for df in root.xpath('//channel/Time'):
## Iterate over attributes of channel/Time
for attrib in df.attrib:
el_data = {}
el_data[attrib] = df.attrib[attrib]
from pandas import *
perf = DataFrame(data)
MemoryError Traceback (most recent call last)
<ipython-input-6-08c8c74f7192> in <module>()
1 from pandas import *
----> 2 perf = DataFrame(data)
3 perf
/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site- packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
418 if isinstance(data[0], (list, tuple, collections.Mapping, Series)):
--> 419 arrays, columns = _to_arrays(data, columns, dtype=dtype)
420 columns = _ensure_index(columns)
/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
5457 return _list_of_dict_to_arrays(data, columns,
5458 coerce_float=coerce_float,
-> 5459 dtype=dtype)
5460 elif isinstance(data[0], Series):
5461 return _list_of_series_to_arrays(data, columns,
/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site- packages/pandas/core/frame.pyc in _list_of_dict_to_arrays(data, columns, coerce_float, dtype)
5521 for d in data]
-> 5523 content = list(lib.dicts_to_array(data, list(columns)).T)
5524 return _convert_object_array(content, columns, dtype=dtype,
5525 coerce_float=coerce_float)
/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/ in pandas.lib.dicts_to_array (pandas/lib.c:7657)()
I have 12960 values of "value" in my xml file. I assume that these memory errors are telling me something about the values in the file not meeting what is expected, but that doesn't match with a memory error, and I could not figure it out from other SO questions regarding memory errors or from the pandas documentation.
An attempt to get the data types yields no information. Maybe there are no types? Perhaps because they are elements in an element tree. (I tried to print .pyval, but it only told me there was no attribute.) el_data is of type "dict"
print(objectify.dump(root))[700:1000] #print a subset of types
name = 'zone'
Time = None [_Element]
* Time = '2013-05-01 00:00:00'
value = '258' [_Element]
Time = None [_Element]
* Time = '2013-05-01 00:01:00'
value = '259' [_Element]
I built this code based on the book Python for Data Analysis and other examples found on SO for parsing XML. I am still new to python.
Running Python 2.7.2 on Mac OS 10.7.5
Upvotes: 2
Views: 2947
Reputation: 1518
Answer based on help from Jeff and JoeKington. The data needed to be put into lists separately before being pushed into the dataframe. The memory error was being caused by the multiple "elements" which were not able to be put into a data frame. Instead, each element dict needs to be put into a list which can go into a data frame.
This works:
for df in root.xpath('//channel/Time'):
## Iterate over attributes of channel
for attrib in df.attrib:
## value is a child of time, and iterate
subfields = df.getchildren()
for subfield in subfields:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12960 entries, 0 to 12959
Data columns (total 2 columns):
Time 12960 non-null values
value 12960 non-null values
dtypes: object(2)
Time value
0 2013-05-01 00:00:00 258
1 2013-05-01 00:01:00 259
2 2013-05-01 00:02:00 258
3 2013-05-01 00:03:00 257
4 2013-05-01 00:04:00 257
Upvotes: 1