Reputation: 2717
I download and scrape a webpage for some data in TSV format. Around the TSV data is HTML that I don't want.
I download the html for the webpage, and scrape out the data I want, using beautifulsoup. However, I've now got the TSV data in memory.
How can I use this TSV data in memory with pandas? Every method I can find seems to want to read from file or URI rather than from data I've already scraped in.
I don't want to download text, write it to file, and then rescrape it.
#!/usr/bin/env python2
from pandas import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2
def main():
url = "URL"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
# pre is the tag that the data is within
tab_sepd_vals = soup.pre.string
data = p.LOAD_CSV(tab_sepd_vals)
process(data)
Upvotes: 0
Views: 440
Reputation: 68116
If you feed the text/string version of the data into a StringIO.StringIO
(or io.StringIO
in Python 3.X), you can pass that object to the pandas parser. So your code becomes:
#!/usr/bin/env python2
import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2
import StringIO
def main():
url = "URL"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
# pre is the tag that the data is within
tab_sepd_vals = soup.pre.string
# make the StringIO object
tsv = StringIO.StringIO(tab_sepd_vals)
# something like this
data = p.read_csv(tsv, sep='\t')
# then what you had
process(data)
Upvotes: 3
Reputation: 188004
Methods like read_csv
do two things, they parse the CSV and they construct a DataFrame
object - so in your case you might want to construct the DataFrame
directly:
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1], ['b', 2], ['c', 3]])
>>> print(df)
0 1
0 a 1
1 b 2
2 c 3
The constructor accepts a variety of data structures.
Upvotes: 1