Reputation: 1121
I have a folder poster_folder
containing jpg files say for example 1.jpg,2.jpg, 3.jpg
path to this folder is:
from pathlib import Path
from PIL import Image
images_dir = Path('C:\\Users\\HP\\Desktop\\PGDinML_AI_IIITB\\MS_LJMU\\Dissertation topics\\Project_2_Classification of Genre for Movies using Machine Leaning and Deep Learning\\Final_movieScraping_data_textclasification\\posters_final').expanduser()
I have a data frame with jpg image info as:
df_subset_cleaned_poster.head(3)
movie_name movie_image
Lion_king 1.jpg
avengers 2.jpg
iron_man 3.jpg
I am trying to plot a scatter of width and height of all jpg files (as they are of different resolution) in the folder as below :
height, width = np.empty(len(df_subset_cleaned_poster)), np.empty(len(df_subset_cleaned_poster))
for i in range(len(df_subset_cleaned_poster.movie_image)):
w, h = Image.open(images_dir.joinpath(df_subset_cleaned_poster['movie_image'][i])).size
width[i], height[i] = w, h
plt.scatter(width, height, alpha=0.5)
plt.xlabel('Width'); plt.ylabel('Height'); plt.show()
This is throwing error: KeyError: 208
df_subset_cleaned_poster.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10225 entries, 0 to 10986
Data columns (total 2 columns):
movie_name 10225 non-null object
movie_image 10225 non-null object
dtypes: object(2)
Upvotes: 0
Views: 193
Reputation: 2516
As discussed in the comments: The issue seems to be in the creating of the dataframe or in the the csv file itself.
I was able to create a proper scatter plot with the following code:
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
from io import StringIO
if __name__ == '__main__':
images_dir = Path("../data/images")
infile = StringIO("""movie_name,movie_image
Lion_king,1.jpg
avengers,2.jpg
iron_man,3.jpg
""")
df_subset_cleaned_poster = pd.read_csv(infile)
n = len(df_subset_cleaned_poster)
height, width = np.empty(n), np.empty(n)
for i, filename in enumerate(df_subset_cleaned_poster.movie_image):
w, h = Image.open(images_dir / filename).size
width[i], height[i] = w, h
plt.scatter(width, height, alpha=0.5)
plt.xlabel('Width')
plt.ylabel('Height')
plt.show()
I suggest that you use this code as the starting point for further experiments. I am using enumerate
to iterate over all rows in df_subset_cleaned_poster.movie_image
. This should be more robust against IndexErrors on its own.
As you can see, I replaced the infile
with a mock string to StringIO
. Just replace it with infile = open("your_file.txt")
to use the real data again.
Upvotes: 1