Reputation: 374
Sorry for the vague title, but I can't figure out a clearer one to describe my situation.
I have a MySQL table storing the referencing relationships between articles. The table structure is given below:
Table name: PaperReferences
PaperID | ReferencedPaperID |
---|---|
1 | 2 |
1 | 3 |
1 | 4 |
5 | 1 |
5 | 4 |
6 | 1 |
7 | 1 |
I want to calculate an indicator for each paper, defined as the number of papers that cite it but not its references. For example, this indicator for paper 1
is 2 since it has 3 citations (5, 6, 7) but only 2 do not cite its references (6, 7). In other words, paper 5
is not counted because it also cites paper 4
, which is one of the references of paper 1
.
I am currently doing it with Python and iterations. However, it is too slow and I wonder if I can get the results using a single SQL query.
The most tricky part for me is to filter out invalid citing papers that also cite paper 1
's references.
Upvotes: 0
Views: 229
Reputation: 2539
First figure out what is disallowed, then Join it to your citation list.
Cycles that are disallowed:
Select DISTINCT X.PaperID as Citer, Y.PaperID as disallowed
From PaperReferences X Inner Join PaperReferences Y
On X.ReferencedPaperID=Y.ReferencedPaperID
Where X.PaperID<>Y.PaperID
Now left join the table to this disallowed set, filter out the matches, and count your total:
Select ReferencedPaperID as paper, Count(*) as citations
From PaperReferences A Left Outer Join (
Select DISTINCT X.PaperID as Citer, Y.PaperID as disallowed
From PaperReferences X Inner Join PaperReferences Y
On X.ReferencedPaperID=Y.ReferencedPaperID
Where X.PaperID<>Y.PaperID
) Z on A.PaperID=Z.Citer And A.ReferencedPaperID=Z.Disallowed
Where Z.disallowed Is Null
Group By ReferencedPaperID
Upvotes: 1
Reputation: 48770
You can do:
select ReferencedPaperID, count(*) as indicator
from PaperReferences p
left join (
select b.PaperID
from PaperReferences a
join PaperReferences b on b.PaperId = a.PaperId
join PaperReferences c on c.ReferencedPaperID = b.ReferencedPaperID
where a.ReferencedPaperID = c.PaperId
) f on f.PaperId = p.PaperId
where f.PaperId is null
group by ReferencedPaperID
Result:
ReferencedPaperID indicator
------------------ ---------
2 1
3 1
4 1
1 2
See running example at DB Fiddle.
Upvotes: 1
Reputation: 54698
I'm curious to know why your Python is so slow. Doesn't this solve the problem pretty quickly?
data = (
(1, 2),
(1, 3),
(1, 4),
(5, 1),
(5, 4),
(6, 1),
(7, 1)
)
import collections
cite = collections.defaultdict(set)
cited = collections.defaultdict(set)
for row in data:
cite[row[0]].add(row[1])
cited[row[1]].add(row[0])
# We remove 5 from cited[1] because cite[1] and cite[5] overlap.
for k,v in cited.items():
toremove = set()
for c in v:
if cite[k] & cite[c]:
print(f"removing {c} from {k}")
toremove.add( c )
v -= toremove
print(cite)
print(cited)
for k,v in cited.items():
print( f"{k} has {len(v)} sole citations" )
Upvotes: 1