Person X
Person X

Reputation: 9

CSV file to neighbor relation graph with 1's and 0's

There is a CSV file that contains states and their neighbors. In python, I want to create a graph with this file. How can I convert this data to a 2D List that can show connections with 1's and 0's.

CSV

States Neighbors
Florida Alabama, Georgia
Alabama Florida, Georgia, Tennessee, Mississippi
Tennessee Alabama
Georgia Alabama, Florida
Mississippi Alabama

2D List like this one but with only 1's and 0's

Florida Alabama Tennessee Georgia Mississippi
Florida 1 1 0 1 0
Alabama 1 1 1 1 1
Tennessee 0 1 1 0 0
Georgia 1 1 0 1 0
Mississippi 0 1 0 0 1

Upvotes: 0

Views: 171

Answers (4)

Caíque Filipini
Caíque Filipini

Reputation: 21

First, import pandas and read data:

import pandas as pd

file = open('data.csv', 'r')
data = file.readlines()

'data' will look like this:

data

['Florida;Alabama,Georgia\n',
 'Alabama;Florida,Georgia,Tennessee,Mississippi\n',
 'Tennessee;Alabama\n',
 'Georgia;Alabama,Florida\n',
 'Mississippi;Alabama\n']

Fix your data to friendly format:

# creating a list of cities and its neighboors
for i in range(len(data)):
    data[i] = data[i].strip() # to remove '\n'
    data[i] = data[i].split(sep=';') # 'Florida;Alabama,Georgia' > ['Florida', 'Alabama,Georgia']
    data[i][1] = data[i][1].split(sep=',') # 'Alabama,Georgia' > ['Alabama', 'Georgia']

Your data will look like this (much better 😅):

data

 [['Florida', ['Alabama', 'Georgia']],
 ['Alabama', ['Florida', 'Georgia', 'Tennessee', 'Mississippi']],
 ['Tennessee', ['Alabama']],
 ['Georgia', ['Alabama', 'Florida']],
 ['Mississippi', ['Alabama']]]

Then, create a list of cities and neighboors. It will help you create DataFrame:

# creating a list of cities and neighboors
cities = []
neighboors = []
for d in data:
    cities.append(d[0])
    neighboors.extend(d[1])
neighboors = list(set(neighboors)) # to remove duplicates

Lists will look like this:

print('Cities List:', cities,'\nNeighboors List:', neighboors)

Cities List: ['Florida', 'Alabama', 'Tennessee', 'Georgia', 'Mississippi'] 
Neighboors List: ['Mississippi', 'Alabama', 'Georgia', 'Florida', 'Tennessee']

Then, create a dataframe and replace nan to 0 or 1, according to 'data':

# creating a dataframe with columns and index
df = pd.DataFrame(index=cities, columns=neighboors)

# replace nan to 1 or 0
for d in data:
    for n in d[1]:
        df.loc[d[0], n] = 1
df.fillna(0, inplace=True)

'df' will look like this:

result df

I hope it helped you 😀

Upvotes: 0

Andrew Eckart
Andrew Eckart

Reputation: 1728

If you are working with graphs in Python, I strongly recommend the NetworkX package (docs here).

It has many tools for manipulating graph representations as well as implementations of most common graph algorithms.

For example, suppose your graph is stored in CSV format where the first state on each line is followed by a list of its neighbors:

$ cat data.csv
Florida,Alabama,Georgia
Alabama,Florida,Georgia,Tennessee,Mississippi
Tennessee,Alabama
Georgia,Alabama,Florida
Mississippi,Alabama

Then you can read it in and view the adjacency matrix representation easily:

>>> import networkx as nx
>>> G = nx.read_adjlist("data.csv", delimiter=",")
>>> A = nx.linalg.graphmatrix.adjacency_matrix(G)
>>> A.todense()
matrix([[0, 1, 1, 0, 0],
        [1, 0, 1, 1, 1],
        [1, 1, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0]])

An alternative to the adjacency matrix representation is to use a dict of dicts, which is sparse, indexed in the same fashion, and a bit easier to read:

>>> nx.convert.to_dict_of_dicts(G, edge_data=1)
{'Florida': {'Alabama': 1, 'Georgia': 1}, 'Alabama': {'Florida': 1, 'Georgia': 1, 'Tennessee': 1, 'Mississippi': 1}, 'Georgia': {'Florida': 1, 'Alabama': 1}, 'Tennessee': {'Alabama': 1}, 'Mississippi': {'Alabama': 1}}

Upvotes: 0

Henry Ecker
Henry Ecker

Reputation: 35626

Try str.split + explode + str.get_dummies + sum:

Then use fill_diagonal to add the self relationships in:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Cities': {0: 'Florida', 1: 'Alabama', 2: 'Tennessee', 3: 'Georgia',
               4: 'Mississippi'},
    'Neighbors': {0: 'Alabama, Georgia',
                  1: 'Florida, Georgia, Tennessee, Mississippi', 2: 'Alabama',
                  3: 'Alabama, Florida', 4: 'Alabama'}
})

# split and explode strings into rows convert to dummies then sum 
# to get totals per city
df = (
    df.set_index('Cities')['Neighbors'].str.split(', ')
        .explode()
        .str.get_dummies()
        .sum(level=0)
)

# Fill Diagonal to include self relationship as shown in output
np.fill_diagonal(df.values, 1)

print(df)

df:

             Alabama  Florida  Georgia  Mississippi  Tennessee
Alabama            1        1        1            1          1
Florida            1        1        1            0          0
Georgia            1        1        1            0          0
Mississippi        1        0        0            1          0
Tennessee          1        0        0            0          1

Or split + explode + crosstab + fill_diagonal:

# split and explode strings into rows
df = df.set_index('Cities')['Neighbors'].str.split(', ').explode()

# Cross tab to calculate relationship
df = pd.crosstab(df.index, df).rename_axis(None).rename_axis(None, axis=1)

# Fill Diagonal to include self-relationship as shown in output
np.fill_diagonal(df.values, 1)

df:

             Alabama  Florida  Georgia  Mississippi  Tennessee
Alabama            1        1        1            1          1
Florida            1        1        1            0          0
Georgia            1        1        1            0          0
Mississippi        1        0        0            1          0
Tennessee          1        0        0            0          1

To get a numpy array:

df.to_numpy()
[[1 1 1 0 0]
 [0 1 1 1 1]
 [1 0 1 0 0]
 [1 1 0 1 0]
 [1 0 0 0 1]]

or a list:

df.to_numpy().tolist()
[[1, 1, 1, 0, 0],
 [0, 1, 1, 1, 1],
 [1, 0, 1, 0, 0],
 [1, 1, 0, 1, 0],
 [1, 0, 0, 0, 1]]

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

Another solution with .get_dummies and .sum(level=0):

df["Neighbors"] = df["Neighbors"].str.split(", ")
df = pd.get_dummies(
    df.explode("Neighbors").set_index("Cities")["Neighbors"]
).sum(level=0)
np.fill_diagonal(df.values, 1)
print(df)

Prints:

             Alabama  Florida  Georgia  Mississippi  Tennessee
Cities                                                        
Alabama            1        1        1            1          1
Florida            1        1        1            0          0
Georgia            1        1        1            0          0
Mississippi        1        0        0            1          0
Tennessee          1        0        0            0          1

Upvotes: 1

Related Questions