Big Data Derby 2022 by Kaggle (the world’s largest community of data scientists and machine learning enthusiasts that was acquired by Google/Alphabet in 2017) (USA)

The goal of this competition is to analyze horse racing tactics, drafting strategies, and path efficiency. You will develop a model using never-before-released coordinate data along with basic race information. Your work will help racing horse owners, trainers, and veterinarians better understand how equine performance and welfare fit together. With better data analysis, equine welfare could significantly improve.

Context: Injury prevention is a critical component in modern athletics. Sports that involve animals, such as horse racing, are no different than human sport. Typically, efficiency in movement correlates to both improvements in performance and injury prevention. A wealth of data is now collected, including measures for heart rate, EKG, longitudinal movement, dorsal/ventral movement, medial/lateral deviation, total power and total landing vibration. Your data science skills and analysis are needed to decipher what makes the most positive impact. In this competition, you will create a model to interpret one aspect of this new data. You’ll be among the first to access X/Y coordinate mapping of horses during races. Using the data, you might analyze jockey decision making, compare race surfaces, or measure the relative importance of drafting. With considerable data, contestants can flex their creativity problem solving skills. The New York Racing Association (NYRA) and the New York Thoroughbred Horsemen’s Association (NYTHA) conduct world class thoroughbred racing at Aqueduct Racetrack, Belmont Park and Saratoga Race Course. With your help, NYRA and NYTHA will better understand their vast data set, which could lead to new ways of racing and training in a highly traditional industry. With improved use of horse tracking data, you could help improve equine welfare, performance and rider decision making.

Our sport is currently investing significant money in collecting far more precise tracking data in the hopes of improving equine welfare. Along with stride data, we can now collect measures for heart rate, EKG, longitudinal movement, dorsal/ventral movement, medial/lateral deviation, total power and total landing vibration. However, we do not have analysts with the appropriate expertise to help decipher these data sets.

We hope this competition allows us to interact with data scientists to help find solutions to equine safety issues as well as develop a roster of academics and motivated hobbyists who lead us in analyzing the coming generations of data.


Your challenge is to generate actionable, practical, and novel insights from horse tracking data that devises innovative and data-driven approaches to analyzing racing tactics, drafting strategies and path efficiency. There are several potential topics for participants to analyze.

These include, but are not limited to:

  • Create a horse rating measuring expected finish position versus actual finish position. How does a horse’s expected finish position change through the running of a race? Does this metric rely solely on a horse’s own position or is it influenced by the position of competitors?
  • What are optimal racing strategies? Considering different venues, surfaces and race distances. Create a jockey rating based upon path efficiency?
  • Create a surface measure model which would rate the fairness of different paths on a racecourse that may be beneficial or harmful to finish position based. This may be a result of unknown barometric, weather or maintenance factors.
  • Create a model measuring the existence (or not) and relevance of a drafting benefit.
  • Create a model reveal optimal gait patterns. Does the model differ for such factors as age, distance, race section or surface?

Contestants should not feel limited to these suggestions.

The above list is not comprehensive, nor is it meant to be a guide for participants to cover.

Submissions that examine one idea more thoroughly are preferred versus those that examine several ideas somewhat thoroughly.


An entry to the competition consists of a Notebook submission that is evaluated on the following five components, where 0 is the low score and 100 is the high score. Submissions will be judged based on how well they address:

Innovation (25 Points Total)

  • Is this a novel way of looking at tracking data? (10 Pts)
  • Are the statistical/machine learning approaches using the most up-to-date standards? (5 Pts)
  • Will the conclusions challenge the status quo of horse racing methods? (10 Pts)

Relevance (30 Points Total)

  • Can the conclusions influence equine welfare, equine performance or rider decision making? (10 Points)
  • Can the conclusions be the basis of future research on future (larger, more granular) data sets? (10 Points)
  • Are the conclusions something that horse racing participants (e.g. owners, trainers, veterinarians) can understand, digest and debate? (10 Points)

Competence (25 Points Total)

  • Given the data, are the statistical models appropriate? (5 Points)
  • Are the conclusions supported by the data? (10 Points)
  • Is the analysis accurate? (10 Points)

Presentation (20 Points Total)

  • Is the writing clear and free of nomenclature? (5 Points)
  • Are the charts and tables provided interesting, visually appealing, and accurate? (5 Points)
  • Can the analysis thread be followed throughout the presentation? (10 Points)


All notebooks submitted must be made public on or before the submission deadline to be eligible. If submitting as a team, all team members must be listed as collaborators on all notebooks submitted.

  • August 11, 2022 – Start Date.
  • November 10, 2022 – Final Submission Deadline.

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.


  • 1st Place – $20,000
  • 2nd Place – $10,000
  • 3rd Place – $10,000
  • 4th Place – $10,000


You can make as many submissions as you like, but we will only consider your latest submission before the deadline of November 10th.

If you are submitting as a team, you do NOT need to merge within the Kaggle platform (it’s actually disabled), but all team members must be listed as collaborators on the submitted Notebook, and all team members must accept the competition rules before the submission deadline.

A valid submission will include a Notebook Analysis. All notebooks submitted must be made public on or before the submission deadline to be eligible. If submitting as a team, all team members must be listed as collaborators on all notebooks submitted.

Putting together this competition was a major undertaking, and we are very thankful to the many co-sponsors who helped to collect the data and funded the prize for this competition. Specifically, we would like to thank: The New York Racing Association (NYRA)The New York Thoroughbred Horsemen’s Association (NYTHA)EquibaseThe Jockey ClubThe Breeders CupKentucky Thoroughbred Association (KTA) and the Thoroughbred Owners and Breeders Association (TOBA).

The $41BN market for data, analytics and machine learning (ML) platforms is being disrupted. Deep neural networks make ML more capable, cloud computing makes it affordable to train more computationally intensive models and store more data, and there’s a move from proprietary platforms (SAS and Matlab) to open-source platforms (like Python, R and TensorFlow). While the race to win the next gen data, analytics and machine learning market is crowded, Kaggle and Google are best placed to win. Kaggle is the world’s largest community of data scientists and machine learners and can differentiate by using its community to build a defensible ecosystem. Google is the world’s best machine learning company: with better tools, talent and experience with machine learning than any other company. Kaggle and Google make an unstoppable combination.