Tom Leung
Tom Leung

Reputation: 374

How to perform this complex calculation using SQL?

Sorry for the vague title, but I can't figure out a clearer one to describe my situation.

I have a MySQL table storing the referencing relationships between articles. The table structure is given below:

Table name: PaperReferences

PaperID ReferencedPaperID
1 2
1 3
1 4
5 1
5 4
6 1
7 1

I want to calculate an indicator for each paper, defined as the number of papers that cite it but not its references. For example, this indicator for paper 1 is 2 since it has 3 citations (5, 6, 7) but only 2 do not cite its references (6, 7). In other words, paper 5 is not counted because it also cites paper 4, which is one of the references of paper 1.

I am currently doing it with Python and iterations. However, it is too slow and I wonder if I can get the results using a single SQL query.

The most tricky part for me is to filter out invalid citing papers that also cite paper 1's references.

Upvotes: 0

Views: 229

Answers (3)

Chris Maurer
Chris Maurer

Reputation: 2539

First figure out what is disallowed, then Join it to your citation list.

Cycles that are disallowed:

Select DISTINCT X.PaperID as Citer, Y.PaperID as disallowed 
From PaperReferences X Inner Join PaperReferences Y
    On X.ReferencedPaperID=Y.ReferencedPaperID
Where X.PaperID<>Y.PaperID

Now left join the table to this disallowed set, filter out the matches, and count your total:

Select ReferencedPaperID as paper, Count(*) as citations
From PaperReferences A Left Outer Join (
    Select DISTINCT X.PaperID as Citer, Y.PaperID as disallowed 
    From PaperReferences X Inner Join PaperReferences Y
        On X.ReferencedPaperID=Y.ReferencedPaperID
    Where X.PaperID<>Y.PaperID
    ) Z on A.PaperID=Z.Citer And A.ReferencedPaperID=Z.Disallowed
Where Z.disallowed Is Null
Group By ReferencedPaperID

Upvotes: 1

The Impaler
The Impaler

Reputation: 48770

You can do:

select ReferencedPaperID, count(*) as indicator
from PaperReferences p
left join (
  select b.PaperID
  from PaperReferences a
  join PaperReferences b on b.PaperId = a.PaperId
  join PaperReferences c on c.ReferencedPaperID = b.ReferencedPaperID
  where a.ReferencedPaperID = c.PaperId
) f on f.PaperId = p.PaperId
where f.PaperId is null
group by ReferencedPaperID

Result:

 ReferencedPaperID  indicator 
 ------------------ --------- 
 2                  1         
 3                  1         
 4                  1         
 1                  2         

See running example at DB Fiddle.

Upvotes: 1

Tim Roberts
Tim Roberts

Reputation: 54698

I'm curious to know why your Python is so slow. Doesn't this solve the problem pretty quickly?

data = (
(1, 2),
(1, 3),
(1, 4),
(5, 1),
(5, 4),
(6, 1),
(7, 1)
)

import collections
cite = collections.defaultdict(set)
cited = collections.defaultdict(set)

for row in data:
    cite[row[0]].add(row[1])
    cited[row[1]].add(row[0])

# We remove 5 from cited[1] because cite[1] and cite[5] overlap.

for k,v in cited.items():
    toremove = set()
    for c in v:
        if cite[k] & cite[c]:
            print(f"removing {c} from {k}")
            toremove.add( c )
    v -= toremove

print(cite)
print(cited)

for k,v in cited.items():
    print( f"{k} has {len(v)} sole citations" )

Upvotes: 1

Related Questions