How to save out in a new column the url which is reading pandas read_html() function?

Question

I am interested in extracting some tables from a website, I defined a list of links where the tables live in. Each link has a several tables with the same number of columns. So, I am extracting all the tables from the list of links into a single table with pandas read_html() function as follows:

links = ['url1.com','url2.com',...,'urlN.com']

import multiprocessing
def process_url(link):
    return pd.concat(pd.read_html(link), ignore_index=False) # add in a new column the link where the table was extracted..

p = multiprocessing.Pool()
df = pd.concat(p.map(process, links), ignore_index=True)

I noticed that it would be helpful to carryout the provenance link of each table (i.e. to save in a new column from which link comes the rows of the final table). Thus, my question is, how to carry out pandas read_html() reference link in a new column?.

For example:

The tables 1 and 2 are in url1.com:

table1:

fruit, color, season, price                      
apple, red, winter, 2$
watermelon, green, winter, 3$
orange, orange, spring, 1$

table2:

fruit, color, season, price
peppermint, green, fall, 3$
pear, yellow, fall, 4$

The table 3 lives in a in url2.com

fruit, color, season, price 
tomato, red, fall, 3$
pumpking, orange, fall, 1$

I would like to save in a new column the place where each table were extracted (i.e. carry out the reference of the table in a new column):

  fruit, color, season, price, link        
0 apple, red, winter, 2$, url1.com             
1 watermelon, green, winter, 3$, url1.com
2 orange, orange, spring, 1$, url1.com
3 peppermint, green, fall, 3$, url1.com
4 pear, yellow, fall, 4$, url1.com
5 tomato, red, fall, 3$, url2.com
6 pumpking, orange, fall, 1$, url2.com

Another example is this "diagram", note that table1 and table2 are in url1.com. On the other hand, table 3 is in url2.com. with the above function I create a single table from tables that are in different links, my objective is to create a column which is conformed of the place the table was extracted (just to save the referece):

source: url1.com

fruit, color, season, price                      
apple, red, winter, 2$
watermelon, green, winter, 3$
orange, orange, spring, 1$

source: url1.com

fruit, color, season, price
peppermint, green, fall, 3$
pear, yellow, fall, 4$            
                                 ---->      fruit, color, season, price, link        
                                            apple, red, winter, 2$, url1.com             
                                            watermelon, green, winter, 3$, url1.com 
                                            orange, orange, spring, 1$, url1.com
                                            peppermint, green, fall, 3$, url1.com
                                            pear, yellow, fall, 4$, url1.com
                                            tomato, red, fall, 3$, url2.com
source: url2.com                            pumpking, orange, fall, 1$, url1.com
fruit, color, season, price 
tomato, red, fall, 3$
pumpking, orange, fall, 1$

Any idea of how to do it?.

MaxU - stand with Ukraine · Accepted Answer

This should do the trick:

def process_url(link):
    return pd.concat(pd.read_html(link), ignore_index=False).assign(link=link)

Explanation: DataFrame.assign(new_column=expression) will add a new virtual column to your DF.

Demo:

In [2]: d1
Out[2]:
   a   b
0  1  10
1  2  20

In [3]: d2
Out[3]:
    a    b
0  11  100
1  12  200

In [4]: link = 'http://url1.com'

In [5]: pd.concat([d1, d2], ignore_index=True).assign(link=link)
Out[5]:
    a    b             link
0   1   10  http://url1.com
1   2   20  http://url1.com
2  11  100  http://url1.com
3  12  200  http://url1.com

How to save out in a new column the url which is reading pandas read_html() function?

Answers (1)

Related Questions