Reputation: 9
There is a CSV file that contains states and their neighbors. In python, I want to create a graph with this file. How can I convert this data to a 2D List that can show connections with 1's and 0's.
CSV
States | Neighbors |
---|---|
Florida | Alabama, Georgia |
Alabama | Florida, Georgia, Tennessee, Mississippi |
Tennessee | Alabama |
Georgia | Alabama, Florida |
Mississippi | Alabama |
2D List like this one but with only 1's and 0's
Florida | Alabama | Tennessee | Georgia | Mississippi | |
---|---|---|---|---|---|
Florida | 1 | 1 | 0 | 1 | 0 |
Alabama | 1 | 1 | 1 | 1 | 1 |
Tennessee | 0 | 1 | 1 | 0 | 0 |
Georgia | 1 | 1 | 0 | 1 | 0 |
Mississippi | 0 | 1 | 0 | 0 | 1 |
Upvotes: 0
Views: 171
Reputation: 21
First, import pandas and read data:
import pandas as pd
file = open('data.csv', 'r')
data = file.readlines()
'data' will look like this:
data
['Florida;Alabama,Georgia\n',
'Alabama;Florida,Georgia,Tennessee,Mississippi\n',
'Tennessee;Alabama\n',
'Georgia;Alabama,Florida\n',
'Mississippi;Alabama\n']
Fix your data to friendly format:
# creating a list of cities and its neighboors
for i in range(len(data)):
data[i] = data[i].strip() # to remove '\n'
data[i] = data[i].split(sep=';') # 'Florida;Alabama,Georgia' > ['Florida', 'Alabama,Georgia']
data[i][1] = data[i][1].split(sep=',') # 'Alabama,Georgia' > ['Alabama', 'Georgia']
Your data will look like this (much better 😅):
data
[['Florida', ['Alabama', 'Georgia']],
['Alabama', ['Florida', 'Georgia', 'Tennessee', 'Mississippi']],
['Tennessee', ['Alabama']],
['Georgia', ['Alabama', 'Florida']],
['Mississippi', ['Alabama']]]
Then, create a list of cities and neighboors. It will help you create DataFrame:
# creating a list of cities and neighboors
cities = []
neighboors = []
for d in data:
cities.append(d[0])
neighboors.extend(d[1])
neighboors = list(set(neighboors)) # to remove duplicates
Lists will look like this:
print('Cities List:', cities,'\nNeighboors List:', neighboors)
Cities List: ['Florida', 'Alabama', 'Tennessee', 'Georgia', 'Mississippi']
Neighboors List: ['Mississippi', 'Alabama', 'Georgia', 'Florida', 'Tennessee']
Then, create a dataframe and replace nan to 0 or 1, according to 'data':
# creating a dataframe with columns and index
df = pd.DataFrame(index=cities, columns=neighboors)
# replace nan to 1 or 0
for d in data:
for n in d[1]:
df.loc[d[0], n] = 1
df.fillna(0, inplace=True)
'df' will look like this:
I hope it helped you 😀
Upvotes: 0
Reputation: 1728
If you are working with graphs in Python, I strongly recommend the NetworkX package (docs here).
It has many tools for manipulating graph representations as well as implementations of most common graph algorithms.
For example, suppose your graph is stored in CSV format where the first state on each line is followed by a list of its neighbors:
$ cat data.csv
Florida,Alabama,Georgia
Alabama,Florida,Georgia,Tennessee,Mississippi
Tennessee,Alabama
Georgia,Alabama,Florida
Mississippi,Alabama
Then you can read it in and view the adjacency matrix representation easily:
>>> import networkx as nx
>>> G = nx.read_adjlist("data.csv", delimiter=",")
>>> A = nx.linalg.graphmatrix.adjacency_matrix(G)
>>> A.todense()
matrix([[0, 1, 1, 0, 0],
[1, 0, 1, 1, 1],
[1, 1, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0]])
An alternative to the adjacency matrix representation is to use a dict of dicts, which is sparse, indexed in the same fashion, and a bit easier to read:
>>> nx.convert.to_dict_of_dicts(G, edge_data=1)
{'Florida': {'Alabama': 1, 'Georgia': 1}, 'Alabama': {'Florida': 1, 'Georgia': 1, 'Tennessee': 1, 'Mississippi': 1}, 'Georgia': {'Florida': 1, 'Alabama': 1}, 'Tennessee': {'Alabama': 1}, 'Mississippi': {'Alabama': 1}}
Upvotes: 0
Reputation: 35626
Try str.split
+ explode
+ str.get_dummies
+ sum
:
Then use fill_diagonal
to add the self relationships in:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Cities': {0: 'Florida', 1: 'Alabama', 2: 'Tennessee', 3: 'Georgia',
4: 'Mississippi'},
'Neighbors': {0: 'Alabama, Georgia',
1: 'Florida, Georgia, Tennessee, Mississippi', 2: 'Alabama',
3: 'Alabama, Florida', 4: 'Alabama'}
})
# split and explode strings into rows convert to dummies then sum
# to get totals per city
df = (
df.set_index('Cities')['Neighbors'].str.split(', ')
.explode()
.str.get_dummies()
.sum(level=0)
)
# Fill Diagonal to include self relationship as shown in output
np.fill_diagonal(df.values, 1)
print(df)
df
:
Alabama Florida Georgia Mississippi Tennessee
Alabama 1 1 1 1 1
Florida 1 1 1 0 0
Georgia 1 1 1 0 0
Mississippi 1 0 0 1 0
Tennessee 1 0 0 0 1
Or split
+ explode
+ crosstab
+ fill_diagonal
:
# split and explode strings into rows
df = df.set_index('Cities')['Neighbors'].str.split(', ').explode()
# Cross tab to calculate relationship
df = pd.crosstab(df.index, df).rename_axis(None).rename_axis(None, axis=1)
# Fill Diagonal to include self-relationship as shown in output
np.fill_diagonal(df.values, 1)
df
:
Alabama Florida Georgia Mississippi Tennessee
Alabama 1 1 1 1 1
Florida 1 1 1 0 0
Georgia 1 1 1 0 0
Mississippi 1 0 0 1 0
Tennessee 1 0 0 0 1
To get a numpy array:
df.to_numpy()
[[1 1 1 0 0]
[0 1 1 1 1]
[1 0 1 0 0]
[1 1 0 1 0]
[1 0 0 0 1]]
or a list:
df.to_numpy().tolist()
[[1, 1, 1, 0, 0],
[0, 1, 1, 1, 1],
[1, 0, 1, 0, 0],
[1, 1, 0, 1, 0],
[1, 0, 0, 0, 1]]
Upvotes: 1
Reputation: 195408
Another solution with .get_dummies
and .sum(level=0)
:
df["Neighbors"] = df["Neighbors"].str.split(", ")
df = pd.get_dummies(
df.explode("Neighbors").set_index("Cities")["Neighbors"]
).sum(level=0)
np.fill_diagonal(df.values, 1)
print(df)
Prints:
Alabama Florida Georgia Mississippi Tennessee
Cities
Alabama 1 1 1 1 1
Florida 1 1 1 0 0
Georgia 1 1 1 0 0
Mississippi 1 0 0 1 0
Tennessee 1 0 0 0 1
Upvotes: 1