Raghavan vmvs
Raghavan vmvs

Reputation: 1265

Read multiple files in python and combine filenames and content into a dataframe

I have the following lists in python created by reading files

files_list = ["A", "B", "C", "D"]

The contents of the files are character vectors as follows

A = ["A1"]
B = ["A2", "B1"]
C = ["A3", "B3", "C3", "C3"]
D = []

I would like to create the following dataframe

Col1   Col2
A      A1
B      A2, B1
C      A3, B3, C3
D     

The filenames should be rendered as one column and the second column should contain the content of the files as a single line.

I tried the following code using a for loop. Note that this is a toy dataset and my dataset is a bit larger

import pandas as pd


df3 = pd.DataFrame()
for i in list_name:
    for j in i:
        df3["Col1"] = j
        df3["Col2"] = i

How do i accomplish the same using the for loop I request someone to take a look. The df3 object i generated was empty

Upvotes: 0

Views: 599

Answers (2)

Andre S.
Andre S.

Reputation: 518

Suppose your files are CSVs you can do the following to use the for loop:

import glob
import pandas as pd
directory = "C:/your/path/to/all/files/*.csv"
df3 = pd.DataFrame(columns=["col", "contents"])

for file in glob.glob(directory):
        col = file.split(sep="\\")[1].split(".")[0]
        try:
            temp = pd.read_csv(file, header=None).values.flatten()
            df3 = df3.append({"col": col, "contents": temp}, ignore_index=True)
        except:
            df3 = df3.append({"col": col, "contents": None}, ignore_index=True)

you get the following DataFrame:

    col contents
0   A   [A1]
1   B   [A2, B1]
2   C   [A3, B3, C3]
3   D   None

Upvotes: 2

Adirio
Adirio

Reputation: 5286

import pandas as pd


files_list = ["A", "B", "C", "D"]
files_cont = [
    ["A1"],
    ["A2", "B1"],
    ["A3", "B3", "C3", "C3"],
    [],
]

df3 = pd.DataFrame({"contents": list(map(sorted, map(set, files_cont)))}, index=files_list)
print(df3)
       contents
A          [A1]
B      [A2, B1]
C  [A3, B3, C3]
D            []

We create a new pd.DataFrame using a dict so that the key is used for the column name (I used "contents" but choose whatever you feel like) and providing the index keyword argument to specify the rows.

As the question removed duplicates in the list, each content list is passed first to the set function to eliminate duplicated elements, then to the sorted function to get back a list with sorted elements. If you dont need that just use {"contents": files_cont} instead.

Upvotes: 3

Related Questions