From the Ballpark to the Numbers: Lessons from Baseball on AI and the Synthetic Data Transforming Healthcare
Working with data has always felt a little like possessing a secret superpower. This past summer, I built a model to predict baseball games. Nothing fancy, just pitch velocity, launch angle, roster changes — the kind of statistics most people overlook in favor of more dramatic narratives. I started betting a dollar per game to make the outcomes matter.
And it did work. I could look at the numbers and see a story that hasn’t happened yet, predict which team would win before the first pitch was thrown. Winning felt good — not because of the dollar, but because the logic held.
However, the most rewarding aspect was observing the model function effectively; after dedicating numerous hours to researching logic in my free time, it was fulfilling to see the model provide useful information.
As artificial intelligence has advanced, that sense of power has grown. Tasks that once took days now take seconds. I used to spend hours building Monte Carlo simulations, which use repeated random sampling to model uncertainty and risk, as well as cleaning messy datasets.
Now, a custom prompt in ChatGPT handles it. That manual labor has vanished, and with that extra time, I started looking for harder problems — bigger questions, more interesting challenges.
That’s when I hit the wall. There’s a problem that technology hasn’t solved yet, one I’ve had to watch from the sidelines: cancer.
Watching someone you love fight cancer is a brutal experience. It’s their fight, and you can’t step into the ring for them. There’s no model or simulation that helps. You sit with them, listen, and show up. In those moments, you feel the full weight of your own helplessness as a human.
That helplessness raises a very uncomfortable question: If I can predict the next pitch with my laptop, why can’t we solve this? The data exists. Machine intelligence is more advanced than ever. Brilliant minds want to help. So, what’s missing?
The Bottleneck Isn’t Intelligence — It’s Access
If you’ve spent time on Kaggle, you’ve seen what happens when access is unlocked. Give tens of thousands of data scientists a dataset and a clear question, and they’ll try unconventional approaches traditional researchers might never consider. They blend models, document breakthroughs, and uncover patterns that the data’s original owners never noticed.
That is the true power of shared data. It’s not just about finding better answers; it’s about learning to ask better questions. Yet medicine looks nothing like this. Healthcare data is locked behind institutional walls, complex legal frameworks, and a fear of privacy violations.
Because of this, institutions work in total isolation. Collaboration is a slow and legally fragile process that can take years to negotiate. The global brainpower that wants to help never gets to see the problem. It’s not that people don’t care about finding a cure. It’s simply that the doors are closed to them.
Trust enters the picture here. Patients have been told for years that their data is being used to improve care and advance research. Meanwhile, they hear about data breaches, ransomware attacks, and confusing privacy policies. When trust breaks, people stop sharing. They avoid care, skip clinical trials, and progress slows down to a crawl. In cancer research, where every day matters, this is devastating.
For years, the standard approach was de-identification. The idea was to strip away names and addresses to make the data anonymous. But mathematically, that approach has been broken for years. In a world driven by AI, patterns alone can identify you. Behavior can identify you. Re-identification doesn’t need a name. Once you realize that the old way of protecting privacy is an illusion, you can’t unsee it.
How Do We Let Data Work Without Exposing People?
Synthetic data changes the conversation. It’s not scrubbed or altered data. It’s something fundamentally different. Instead of sharing real patient records, a model learns the statistical structure of an entire dataset: how variables relate, and how patterns emerge across thousands of people. Then it generates entirely new records that behave like real data without belonging to anyone.
There’s no one-to-one mapping. No identifiable patient hiding behind a row of numbers. This enables a subtle but powerful shift in how we work. Researchers can explore and iterate freely without fear of a privacy leak, and organizations can collaborate without transferring raw data.
Medicine’s Kaggle Moment Is Starting
We’re already beginning to see what happens when medicine gets its own Kaggle moment. Some cancer centers have started collaborating through a process called federated learning, which lets them train models across institutions without pooling the raw data in one place.
While federated learning keeps the real data in place, synthetic data creates safe stand-ins that can travel anywhere.
Together, these tools give medicine something it’s never had before: many minds and many different methods to work on the same problem simultaneously.
This isn’t speculative. Synthetic cancer datasets have already been used to prototype models from start to finish before researchers touch real patient data. This has reduced data access timelines from several months down to just a few weeks. It allows teams to debug their code and validate their ideas safely. In some cases, researchers only need a single, tightly controlled interaction with real data at the very end of their project to confirm their findings.
That’s not just a digital sandbox; that is a massive acceleration of progress.
Building Systems That Help Without Harming
Synthetic data gives us a technical foundation to meet those expectations honestly. We can tell patients that their story matters and their privacy matters, and both of those things can be true at the same time.
For years, the dominant narrative around AI and big data has been one of extraction — value pulled away from people, often without their full understanding. Synthetic data flips that story. Someone fighting a disease can contribute to discoveries they may never see, without fear that their most vulnerable information will be exposed. It allows a researcher who’s never stepped inside a hospital to bring fresh, life-saving ideas to the table.
This is the true breakthrough. It’s not about having faster models or bigger computers. It’s about building a system where data helps people without harming them.
Maybe one day, the same tools that help me predict a curveball on a summer night will help someone spend many more summers with the people they love.
Curious how synthetic data and privacy‑safe research could accelerate innovation in your industry? Connect with a One North expert to explore how these tools can help you move faster, reduce uncertainty, and unlock smarter, insight‑driven decisions.
Photo Credit: Ussama Azam | Unsplash
Angela Gibson
As a Senior Data Strategist at One North, Angela is known for her ability to collaborate with teams and leadership to analyze trends, develop intuitive dashboards, and identify key performance indicators that optimize outcomes. She transforms complex information into clear, actionable insights. Her dedication to continuous learning and creating value ensures she approaches every challenge with the goal of driving meaningful and measurable success.
