Repetitions in file names and class labels

It has been observed that the CSV file which is used to download the dataset consists of a few repetitions in terms of URL values (maybe intentional because a simple picture may contain lot of faces); and the assigned class labels for few celebrity name.

The following are referential to two different celebrities, yet possess the same class index.
- Kanchan - nm0437156
- Ilias_Kanchan - nm0437156

Apart from that there are a few entries in the dataset that are pure repetition of entries such that each individual entry possesses the same class index, filename, URL pair. (assuming that the format `{class_index}_{filename.jpg}` should mark a unique entry)

Hope this helps! 
Alternatively, please do let me know I was mistaken and those were on purpose like that.

Sample code to reproduce the problem.
```python
import csv
file_a = open('IMDb-Face.csv', 'r')
spreadsheet = csv.DictReader(file_a)
entries = ['%s_%s' % (entry['index'], entry['image']) for entry in spreadsheet]
print(len(entries), 'entries were found.')
unique_entries = set(entries)
print(len(unique_entries), 'unique entries were found.')
```
```diff
+ 1662888 entries were found.
- 1632927 unique entries were found.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repetitions in file names and class labels #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development