Open
Description
It has been observed that the CSV file which is used to download the dataset consists of a few repetitions in terms of URL values (maybe intentional because a simple picture may contain lot of faces); and the assigned class labels for few celebrity name.
The following are referential to two different celebrities, yet possess the same class index.
- Kanchan - nm0437156
- Ilias_Kanchan - nm0437156
Apart from that there are a few entries in the dataset that are pure repetition of entries such that each individual entry possesses the same class index, filename, URL pair. (assuming that the format {class_index}_{filename.jpg}
should mark a unique entry)
Hope this helps!
Alternatively, please do let me know I was mistaken and those were on purpose like that.
Sample code to reproduce the problem.
import csv
file_a = open('IMDb-Face.csv', 'r')
spreadsheet = csv.DictReader(file_a)
entries = ['%s_%s' % (entry['index'], entry['image']) for entry in spreadsheet]
print(len(entries), 'entries were found.')
unique_entries = set(entries)
print(len(unique_entries), 'unique entries were found.')
+ 1662888 entries were found.
- 1632927 unique entries were found.
Metadata
Assignees
Labels
No labels
Activity