Data Scientist Intern

Internship, Research & Development

United States - NY, New York

Requisition ID



Medidata: Power Smarter Treatments and Healthier

People Medidata is leading the digital transformation of life sciences, creating hope for millions of patients. Medidata helps generate the evidence and insights to help pharmaceutical, biotech, medical device and diagnostics companies, and academic researchers accelerate value, minimize risk, and optimize outcomes. More than one million registered users across 1,900+ customers and partners access the world's most trusted platform for clinical development, commercial, and real-world data. Medidata, a Dassault Systèmes company, is headquartered in New York City and has offices around the world to meet the needs of its customers. Discover more at and follow us @medidata.

At Medidata, interns will have the opportunity to accelerate their careers by working closely with experienced professionals and gain valuable, hands-on, full-time work experience.  By being a part of our global organization, interns have the opportunity to work alongside our talented and committed professionals helping them to build a strong foundation for achieving their career goals.  For 12 weeks, beginning May 20, 2024, interns will have an opportunity to gain a deep understanding of what it means to be a Medidatian. United around a single goal of empowering smarter treatments and healthier people.  Medidatians work in a culture of curiosity, innovation and fun.  You will be contributing to the line of business with sustainable and meaningful work. Our Summer Internship program also includes instructor led training, guided mentorship, exposure to senior leadership and community service.  In addition to individual and specific related responsibilities, each intern will participate in our Intern Innovation Lab.  Assigned to cross-functional teams, interns will work closely to develop an innovative solution to a business problem currently facing Medidata.  As they work diligently to present their final solutions to a panel of top Medidata leaders, we are confident that our interns will make a significant impact on our business.

The Position: We are seeking an intern to play a key role in our dynamic team, driving innovative research in synthetic data generation and deriving insights from clinical trial data using LLM-based solutions. Utilizing industry-leading data assets and analytical models, our team is dedicated to transforming the clinical development industry, ensuring both clinical and operational success for our clients and partners.

This key components of this internship are:

  • Performing research and development in the area of synthetic data generation and LLMs
  • Implementing and evaluating algorithms based on research literature
  • Creating, documenting, and maintaining code
  • Reporting findings to internal teams

This project centers on leveraging generative models to enhance the generation of synthetic clinical trial data and derive valuable clinical insights. The scope of the project includes, processing and training generative models using large-scale, complex clinical datasets and external data sources. Key aspects of the project encompass data standardization, feature extraction, identification of external data for augmentation, model development, and evaluation. Collaborating with the Medidata AI Synthetic Data Science team, specialists in clinical trial datasets, this role focuses on building advanced models to extract clinical insights across disease indications. The project demands expertise in longitudinal datasets, machine learning, deep learning, LLMs, NLP/NLU models, and/or generative AI models to craft a cutting-edge solution for processing and extracting insights from clinical trial longitudinal datasets.

Your Requirements:

  • Strong performance in a Bachelor's program in Data Science, Mathematics, Statistics, or Computer Science.
  • Proficiency in Python (with pandas) that allows self-sufficiency in analyzing tabular and longitudinal data. 2+ years experience with Machine Learning and AI, with a focus on areas such as NLP, Deep Learning, Language Modeling, and/or Generative Modeling. Ability to apply statistical analysis techniques.
  • Competence in utilizing ML techniques including Transformers, LSTM, GANs, CNNs, VAEs or other deep gradient-based methods.
  • Demonstrated ability to think creatively, independently access and analyze data, and effectively evaluate both the big picture and key details.
  • Excellent interpersonal, verbal, and written communication skills.
  • Strong time management and problem-solving abilities.
  • Capable of multitasking in a fast-paced environment, with the ability to prioritize deliverables for optimal results.

As with all roles, Medidata sets ranges based on a number of factors including function, level, candidate expertise and experience, and geographic location. The salary range for positions that will be physically based in New York, NY is $32.00 to $37.00 per hour with a $3,500 sign on bonus.


As a game-changer in sustainable technology and innovation, Medidata, Dassault Systèmes company, is striving to build more inclusive and diverse teams across the globe. We believe that our people are our number one asset and we want all employees to feel empowered to bring their whole selves to work every day. It is our goal that our people feel a sense of pride and a passion for belonging. As a company leading change, it’s our responsibility to foster opportunities for all people to participate in a harmonized Workforce of the Future.