Open Data for Deep Learning

Recent Additions

Symbolic Music Datasets

Natural-Image Datasets

Artificial Datasets

Facial Datasets

Text Datasets

Speech Datasets

  • TIMIT Speech Corpus: phoneme classification
  • MovieLens The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
  • Jester: 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
  • Netflix Prize: Netflix released an anonymized version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
  • Book-Crossing dataset: From the Book-Crossing community. Contains 278,858 users providing 1,149,780 ratings about 271,379 books.

Miscellaneous Datasets

Thanks to for many of these links and dataset descriptions. Any suggestions of open data sets we should include for the Deeplearning4j community are welcome!

Chat with us on Gitter