Open Data for Deep Learning
- Open Source Biometric Recognition Data
- 300 terabytes of high-quality data from the Large Hadron Collider (LHC) at CERN
- Data USA: The most comprehensive visualization of US public data
- EU Surveillance Atlas of Infectious Diseases
- EU Gender statistics database
- The Netherlands’ Nationaal Georegister (Dutch)
- United Nations Development Programme Projects
Symbolic Music Datasets
- Piano-midi.de: classical piano pieces
- Nottingham : over 1000 folk tunes
- MuseData: electronic library of classical music scores
- JSB Chorales: set of four-part harmonized chorales
- MNIST: handwritten digits
- CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories
- Caltech 101: pictures of objects belonging to 101 categories
- Caltech 256: pictures of objects belonging to 256 categories
- STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Like CIFAR-10 with some modifications.
- The Street View House Numbers (SVHN) Dataset
- NORB: binocular images of toy figurines under various illumination and pose
- Imagenet: image database organized according to the WordNethierarchy
- Pascal VOC: various object recognition challenges
- Labelme: A large dataset of annotated images
- COIL 20: different objects imaged at every angle in a 360 rotation
- COIL100: different objects imaged at every angle in a 360 rotation
- Arcade Universe - An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
- A collection of datasets inspired by the ideas from BabyAISchool:
- BabyAIShapesDatasets: distinguishing between 3 simple shapes
- BabyAIImageAndQuestionDatasets: a question-image-answer dataset
- Datasets generated for the purpose of an empirical evaluation of deep architectures (DeepVsShallowComparisonICML2007):
- MnistVariations: introducing controlled variations in MNIST
- RectanglesData: discriminating between wide and tall rectangles
- ConvexNonConvex: discriminating between convex and nonconvex shapes
- BackgroundCorrelation: controling the degree of correlation in noisy MNIST backgrounds.
- Labelled Faces in the Wild: 13,000 images of faces collected from the web, labeled with the name of the person pictured.
- Olivetti: a few images of several different people
- Multi-Pie: The CMU Multi-PIE Face Database
- JACFEE: Japanese and Caucasian Facial Expressions of Emotion
- FERET: The Facial Recognition Technology Database
- mmifacedb: MMI Facial Expression Database
- The Yale Face Database and The Yale Face Database B).
- 20 newsgroups: classification task, mapping word occurences to newsgroup ID
- Reuters (RCV*) Corpuses: text/topic prediction
- Penn Treebank : used for next word prediction or next character prediction
- Broadcast News: large text dataset, classically used for next word prediction
- Multidomain sentiment analysis dataset
- TIMIT Speech Corpus: phoneme classification
- MovieLens The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
- Jester: 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
- Netflix Prize: Netflix released an anonymized version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
- Book-Crossing dataset: From the Book-Crossing community. Contains 278,858 users providing 1,149,780 ratings about 271,379 books.
- CMU Motion Capture Database
- Brodatz dataset: texture modeling
- Million Song dataset
- Merck Molecular Activity Challenge
Thanks to deeplearning.net for many of these links and dataset descriptions. Any suggestions of open data sets we should include for the Deeplearning4j community are welcome!