Reputation: 41
My data set a list of people either working together or alone.
I have have a row for each project and columns with names of all the people who worked on that project. If column 2 is the first empty column given a row it was a solo job, if column 4 is the first empty column given a row then there were 3 people working together.
My goal is to find which people have worked together, and how many times, so I want all pairs in the data set, treating A working with B the same as B working with A.
From this a square N x N would be created with every actor labeling the column and row and in cell (A,B) and (B,A) would have how many times that pair worked together, and this would be done for every pair.
I know of a 'pretty' quick way to do it in Excel but I want it automated, hopefully in Stata or Python, just in case projects are added or removed I can just 1-click the re-run and not have to re-do it every time.
An example of the data, in a comma delimited fashion:
A
A,B
B,C,E
B,F
D,F
A,B,C
D,B
E,C,B
X,D,A
Hope that helps!
Brice. F,D B F F,X,C C,F,D
Upvotes: 1
Views: 1514
Reputation: 77434
I suggest using Python Pandas for this. It enables a slick solutions for formatting your adjacency matrix, and it will make any statistical calculations much easier too. You can also directly extract the matrix of values into a NumPy array, for doing eigenvalue decompositions or other graph-theoretical procedures on the group clusters if needed later.
I assume that the example data you listed is saved into a file called projects_data.csv
(it doesn't need to actually be a .csv file though). I also assume no blank lines between each observations, but this is all just file organization details.
Here's my code for this:
# File I/O part
import itertools, pandas, numpy as np
with open("projects_data.csv") as tmp:
lines = tmp.readlines()
lines = [line.split('\n')[0].split(',') for line in lines]
# Unique letters
s = set(list(itertools.chain(*lines)))
# Actual work.
df = pandas.DataFrame(
np.zeros((len(s),len(s))),
columns=sorted(list(s)),
index=sorted(list(s))
)
for line in lines:
if len(line) == 1:
df.ix[line[0],line[0]] += 1 # Single-person projects
elif len(line) > 1:
# Get all pairs in multi-person project.
tmp_pairs = list(itertools.combinations(line, 2))
# Append pair reversals to update (i,j) and (j,i) for each pair.
tmp_pairs = tmp_pairs + [pair[::-1] for pair in tmp_pairs]
for pair in tmp_pairs:
df.ix[pair[0], pair[1]] +=1
# Uncomment below if you don't want the list
# comprehension method for getting the reverals.
#df.ix[pair[1], pair[0]] +=1
# Final product
print df.to_string()
A B C D E F X
A 1 2 1 1 0 0 1
B 2 1 3 1 2 1 0
C 1 3 0 1 2 2 1
D 1 1 1 0 0 3 1
E 0 2 2 0 0 0 0
F 0 1 2 3 0 1 1
X 1 0 1 1 0 1 0
Now you can do a lot of stuff for free, like see the total number of project partners (repeats included) for each participant:
>>> df.sum()
A 6
B 10
C 10
D 7
E 4
F 8
X 4
Upvotes: 0
Reputation: 353179
Maybe something like this would get you started?
import csv
import collections
import itertools
grid = collections.Counter()
with open("connect.csv", "r", newline="") as fp:
reader = csv.reader(fp)
for line in reader:
# clean empty names
line = [name.strip() for name in line if name.strip()]
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
actors = sorted(set(pair[0] for pair in grid))
with open("connection_grid.csv", "w", newline="") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
[edit: modified to work under Python 3.2]
The key modules are (1)csv
, which makes reading and writing csv files much simpler; (2) collections
, which provides an object called a Counter
-- like a defaultdict(int)
, which you could use if your Python doesn't have Counter
, it's a dictionary which automatically generates default values so you don't have to, and here the default count is 0; and (3) itertools
, which has a combinations
function to get all the pairs.
which produces
,A,B,C,D,E,F,X
A,1,2,1,1,0,0,1
B,2,1,3,1,2,1,0
C,1,3,0,1,2,2,1
D,1,1,1,0,0,3,1
E,0,2,2,0,0,0,0
F,0,1,2,3,0,1,1
X,1,0,1,1,0,1,0
You could use itertools.product
to make building the array a little more compact, but since it's only a line or two I figured it was as simple to do it manually.
Upvotes: 1
Reputation: 1138
I guess that you don't have thousands of people working together in these projects. This implementation is pretty simple.
fp = open('projects.cvs')
# counts how many times each pair worked together
pairs = {}
# each element of `project` is a person
for project in (p[:-1].split(',') for p in fp):
project.sort()
# someone is alone here
if len(project) == 1:
continue
# iterate over each pair
for i in range(len(project)):
for j in range(i+1, len(project)):
pair = (project[i], project[j])
# increase `pairs` counter
pairs[pair] = pairs.get(pair, 0) + 1
from pprint import pprint
pprint(pairs)
It outputs:
{('A', 'B'): 1,
('B', 'C'): 2,
('B', 'D'): 1,
('B', 'E'): 1,
('B', 'F'): 2,
('C', 'E'): 1,
('C', 'F'): 1,
('D', 'F'): 1}
Upvotes: 0
Reputation: 16007
If I were to keep this project around for a while, I'd implement a database and then create the matrix you're talking about from a query against that database.
You have a Project
table (let's say) with one record per project, an Actor
table with one row per person, and a Participant
table with a record per project for each actor that was in that project. (Each record would have an ID
, a ProjectID
, and an ActorID
.)
From your example, you'd have 14 Project
records, 7 Actor
records (A through F, and X), and 31 Participant
records.
Now, with this set up, each cell is a query against this database.
To reconstruct the matrix, first you'd add/update/remove the appropriate records in your database, and then rerun the query.
Upvotes: 0