chandra sutrisno
chandra sutrisno

Reputation: 531

Merge two dataframes using pandas by text similarity

I run a query like below:

select * 
from sd_sms LEFT JOIN categories_phrases 
    on sd_sms.body like  concat('%',categories_phrases.phrase1,'%')
    and sd_sms.body like concat('%',categories_phrases.phrase2,'%')
    and sd_sms.body like concat('%',categories_phrases.phrase3,'%')
    and sd_sms.body like concat('%',categories_phrases.phrase4,'%')

Basically, it will join two tables if one field in table A contains several phrases in table B. But now I need to do this in Python.

Is there any easy way to merge both tables using pandas so it give me same result?

Please advise

Upvotes: 2

Views: 1480

Answers (2)

e.arbitrio
e.arbitrio

Reputation: 588

This code example works with text data and like condition in the join clause.

from pandasql import *
import pandas as pd

pysqldf = lambda q: sqldf(q, globals())

df1 = pd.DataFrame({"name": ['Antony', 'Mark', 'Jacob'], "age":
                                                         [11,12,13]})
df2 = pd.DataFrame({"name": ['Antony', 'Gill', 'John']})

q = """SELECT * FROM df1 LEFT JOIN df2 ON df1.name LIKE '%' || df2.name || '%'"""

df = pysqldf(q)

This is just a dummy DF with example data but I applied a similar condition to your question.

Hope it could be useful.

Upvotes: 1

e.arbitrio
e.arbitrio

Reputation: 588

I don't understand what is your data types because you miss some example data in your answer; but if you need to query pandas dataframe with SQL like sintax you could try to use pandasql package.v It's based on SQLAlchemy ORM tool.

from pandasql import *
import pandas as pd

pysqldf = lambda q: sqldf(q, globals())

q  = """
  SELECT
  m.date
  , m.beef
  , b.births
  FROM
  meat m
  LEFT JOIN
   births b
   ON m.date = b.date
  WHERE
  m.date > '1974-12-31';
  """

meat = load_meat()
births = load_births()

df = pysqldf(q)
df

date    beef    births
0   1975-01-01 00:00:00.000000  2106.0  265775.0
1   1975-02-01 00:00:00.000000  1845.0  241045.0
2   1975-03-01 00:00:00.000000  1891.0  268849.0
3   1975-04-01 00:00:00.000000  1895.0  247455.0
4   1975-05-01 00:00:00.000000  1849.0  254545.0
5   1975-06-01 00:00:00.000000  1849.0  254096.0
6   1975-07-01 00:00:00.000000  1916.0  275163.0
7   1975-08-01 00:00:00.000000  1961.0  281300.0
8   1975-09-01 00:00:00.000000  2065.0  270738.0
9   1975-10-01 00:00:00.000000  2270.0  265494.0
10  1975-11-01 00:00:00.000000  1970.0  251973.0
11  1975-12-01 00:00:00.000000  2055.0  260532.0
12  1976-01-01 00:00:00.000000  2208.0  257455.0
13  1976-01-01 00:00:00.000000  2208.0  259173.0
14  1976-02-01 00:00:00.000000  1966.0  236551.0
15  1976-02-01 00:00:00.000000  1966.0  238153.0
16  1976-03-01 00:00:00.000000  2318.0  257951.0
17  1976-03-01 00:00:00.000000  2318.0  261608.0
18  1976-04-01 00:00:00.000000  2015.0  246469.0
19  1976-04-01 00:00:00.000000  2015.0  250992.0
20  1976-05-01 00:00:00.000000  1969.0  256986.0
21  1976-05-01 00:00:00.000000  1969.0  261572.0
22  1976-06-01 00:00:00.000000  2161.0  250525.0
23  1976-06-01 00:00:00.000000  2161.0  255734.0
24  1976-07-01 00:00:00.000000  2111.0  279630.0
25  1976-07-01 00:00:00.000000  2111.0  279744.0
26  1976-08-01 00:00:00.000000  2233.0  279937.0
27  1976-08-01 00:00:00.000000  2233.0  286496.0
28  1976-09-01 00:00:00.000000  2274.0  273750.0
29  1976-09-01 00:00:00.000000  2274.0  283718.0
... ... ... ...
533 2010-06-01 00:00:00.000000  2320.0  NaN
534 2010-07-01 00:00:00.000000  2229.6  NaN
535 2010-08-01 00:00:00.000000  2286.6  NaN
536 2010-09-01 00:00:00.000000  2252.2  NaN
537 2010-10-01 00:00:00.000000  2234.9  NaN
538 2010-11-01 00:00:00.000000  2235.5  NaN
539 2010-12-01 00:00:00.000000  2270.9  NaN
540 2011-01-01 00:00:00.000000  2122.9  356457.0
541 2011-02-01 00:00:00.000000  2020.4  338521.0
542 2011-03-01 00:00:00.000000  2266.2  350630.0
543 2011-04-01 00:00:00.000000  2052.5  346397.0
544 2011-05-01 00:00:00.000000  2131.9  354886.0
545 2011-06-01 00:00:00.000000  2375.0  348587.0
546 2011-07-01 00:00:00.000000  2134.1  375384.0
547 2011-08-01 00:00:00.000000  2386.9  373333.0
548 2011-09-01 00:00:00.000000  2215.2  367965.0
549 2011-10-01 00:00:00.000000  2215.1  357875.0
550 2011-11-01 00:00:00.000000  2148.8  323788.0
551 2011-12-01 00:00:00.000000  2126.3  353871.0
552 2012-01-01 00:00:00.000000  2113.8  337980.0
553 2012-02-01 00:00:00.000000  2009.0  316641.0
554 2012-03-01 00:00:00.000000  2159.8  347803.0
555 2012-04-01 00:00:00.000000  1990.6  337272.0
556 2012-05-01 00:00:00.000000  2232.0  345257.0
557 2012-06-01 00:00:00.000000  2252.1  346971.0
558 2012-07-01 00:00:00.000000  2200.8  368450.0
559 2012-08-01 00:00:00.000000  2367.5  359554.0
560 2012-09-01 00:00:00.000000  2016.0  361922.0
561 2012-10-01 00:00:00.000000  2343.7  347625.0
562 2012-11-01 00:00:00.000000  2206.6  320195.0

Here the repo: https://github.com/yhat/pandasql and a nice quickstart tutorial http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html

Upvotes: 1

Related Questions