Linguamatics 2017 Text Mining Summit

October 2, 2017 to October 4, 2017
Venue: New Castle , New Hampshire, United States

See the final agenda


The Linguamatics Text Mining Summit 2017 was held on Mon, 2 October – Wed, 4 October 2017 at the beautiful Wentworth by the Sea in New Castle, New Hampshire, USA.

What did the conference include?

  • Customer presentations featuring best practice, case studies and insights on practical approaches to text mining and knowledge discovery
  • Presentations covering what's new in I2E, looking ahead at developments in the pipeline and future directions for text mining and knowledge discovery
  • Roundtable Discussions covering important topics and challenges in the field of text mining and knowledge discovery
  • Opportunities to network with peers and with Linguamatics experts
  • Hands-on workshops, giving new and experienced users the opportunity to explore the full capabilities of I2E, and discuss best practice in consultation with Linguamatics experts
  • Healthcare track to discuss best practice and use of NLP in healthcare
  • Evening social events
  • Partner presentations and exhibits
  • Meals and refreshments provided during the conference are included with the registration fee

Speakers and Bios

Simon Beaulah


Simon  Beaulah  is  Senior  Director,  Healthcare  and  is responsible  for  Linguamatics’  healthcare   products  and  solutions,  including  applications  in  the  areas  of  clinical  risk  models,  population  health,  and  medical  research.  Previously,  Simon  was Marketing Director, Translational Medicine at IDBS/InforSense, where he was responsible for the company’s  market  analysis,  product  marketing,  and Go To Market strategy in healthcare analytics and translational medicine. Prior to IDBS, he was Director  of  Product  Management  at  BioWisdom,  where he was responsible for delivery of customer projects  using  the  company’s  ontology  products.  He  also  worked  as  a  senior  product  manager  at  LION Bioscience and Synomics, and as a software developer at the UK’s Biotechnology and Biological Sciences  Research  Council.  Simon  has  degrees  from Aston University and Cranfield Institute of Technology, UK.

David Birtwell

Penn Medicine BioBank

John Brimacombe


John Brimacombe is a serial entrepreneur and experienced investor. After graduating in Law and Computer Science from Trinity College, Cambridge, he founded Jobstream Group plc, which provides specialist ERP software to the international financial services sector and was acquired by Microgen Plc (MCGN.L). Brimacombe subsequently co-founded pioneering mobile entertainment start-up nGame Ltd., which was acquired by Hands-On Mobile Inc. He served as President/COO of HandsOn Mobile for over 2 years, leading the company through 7 major M&A transactions and massive global expansion. Brimacombe currently chairs the enterprise natural language search tools provider, Linguamatics, is a Partner at Sussex Place Ventures, the venture-capital arm of the London Business School, is a seed-investor in multiple US and UK start-ups and has extensive experience as a non-executive director from start-up to Public markets.

Peter Hornbeck

Cell Signaling Technology

Dongyu Liu


Dongyu Liu is associate director in the science computing group of Translational Sciences department at Sanofi.  Dongyu’s research interests include bioinformatics, data mining and text mining.  He plays a key role in bringing in and employing text mining technology to support ongoing research projects in Sanofi.  He received a Ph.D. from University of Rochester, and did postdoctoral research at Whitehead Institute.

Ross Martin


Dr. Martin is Program Director for Research & Transformation at CRISP, Maryland’s state-designated health information exchange serving MD, DC, WV and the region. He is responsible for all aspects ambulatory provider outreach and the CRISP Research Initiative. Before joining CRISP in 2015, Dr. Martin was VP of Policy and Development at AMIA, Specialist Leader at Deloitte Consulting, Director of Health Information Convergence at BearingPoint, and Director of Healthcare Informatics at Pfizer. He has worked as an obstetric house physician and as a professional writer. He claims to be the worlds’ leading medical informatimusicologist.

Paul Milligan


David Milward


David Milward is chief technology officer (CTO) at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has over 20 years experience of product development, consultancy and research in natural language processing (NLP). After receiving a PhD from the University of Cambridge, he was a researcher and lecturer at the University of Edinburgh. He has published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics.

Bryan Morganti


Allen Murvine


▪  Over 30 years of experience in Information Technology specializing in strategic integration of software, data and technology.

▪  As the Director of NLP Products and Services at Intelligent Medical Objects, responsible for bringing  NLP related solutions to the marketplace.

▪  IMO NLP solutions uniquely integrate IMO clinical terminologies with NLP partner technologies.

Jane Reed


Jane Reed is the head of life science strategy. She is responsible for developing the strategic vision for Linguamatics’ growing product portfolio and business development in the life science domain. Jane has extensive experience in life sciences informatics. She worked for more than 15 years in vendor companies supplying data products, data integration and analysis and consultancy to pharma and biotech - with roles at Instem, BioWisdom, Incyte, and Hexagen. Before moving into the life science industry, Jane worked in academia with post-docs in genetics and genomics.

Tony Sheaffer


Tony Sheaffer is the master data management specialist for Informatica where he focuses on making data reliable so you can maximize your information assets.  With over 10 years of experience in life sciences and healthcare he has consulted with multiple institutions on how to migrate their structured and unstructured data so a complete picture can be formed of their research.  He has spoken at multiple conferences on a variety of topics around big data analytics and holds degrees in biomedical and database technologies. 

Lue Yen Tucker

Kaiser Permanente

Lue-Yen Tucker is a Senior Data Consultant at the Kaiser Permanente Northern California Division of Research (KPDOR) in Oakland, CA.  Lue-Yen has a Bachelor of Arts degree in Public Administration from National Chung-Hsing University in Taiwan and has done graduate work in computer science at the California State University in Chico, CA.  She has been a member of the DOR Biostatistical Consulting Unit led by Mary Anne Armstrong since 2006, providing research proposal, programming, analytical, and publication support to KP clinicians. Prior to joining DOR, she was a programmer/analyst at the Institute for Health & Aging, School of Nursing, UCSF from 2005 to 2006.  In 2014, Lue-Yen began using I2E from Linguamatics to search for uterine weight, morcellation, and estimated blood loss from pathology reports and operating notes for a research study on women health. Beginning 2016, she started using I2E to find a series of carotid artery related values from ultrasound, CT, and MRI reports and to identify gallbladder polyps including dimension and/or size. Lue-Yen has co-authored 12 publications, and numerous oral and poster presentations at various regional and national conferences.

Presentations and Abstracts

Opportunities for applying NLP in the health information exchange environment

Ross Martin, CRISP

CRISP is one of the largest and most successful health information exchanges (HIEs) in the country, serving as Maryland’s state designated HIE, the de facto HIE for Washington, DC, and providing the technical infrastructure for West Virginia’s Health Information Network (WVHIN). CRISP connects to more than 100 hospitals and more than 1,000 other provider organizations from physician offices to skilled nursing facilities in the region and distributes more than 1.7 million clinical encounter notifications each month. More than 30,000 times each week, providers access the CRISP Clinical Query Portal to review data from over 270 hospital data feeds, including lab reports, radiology reports, discharge summaries, and other structure and unstructured clinical documents. Accessing these rich data sources can be lifesaving and outcome altering, but it can also be overwhelming. Dr. Martin will discuss potential applications for NLP in a real-world, mission critical setting for identifying relevant information in a sea of data, informing clinical decision support, improving quality measures, and monitoring public and population health.

Application of Text Mining in HLA Allele Disease Association Annotation

Dongyu Liu, Sanofi

The human leukocyte antigen (HLA) region is the most polymorphic region of the human genome.  HLA alleles have been associated with more than 40 different autoimmune diseases, various types of cancer, Infectious disease and drug adverse events. However, there are no known resources that systematically annotate the association of HLA alleles and diseases.  In an internal multiple sclerosis biomarker project, we have established a whole exome sequencing based HLA Typing and analysis workflow and identified over 400 HLA alleles.  We applied Linguamatics I2E platform to search the literature to annotate the association of the HLA alleles with diseases and drug hypersensitivity. A web interface with searching function is created to review the text mining results and stored the curated the annotation to a backend knowledge base.

Tackling the clinical documentation burden with Linguamatics and Intelligent Medical Objects (IMO)

Allen Murvine & Dr. Roger Gildersleeve, IMO

While adoption of Electronic Health Records (EHR) systems across the US has risen significantly, so has the volume physician created, rich clinical narratives.  The demands on healthcare providers to accurately identify and code the discrete clinical data elements from these narratives used for decision support, interoperability and coding has become a cause for burnout and dissatisfaction.

IMO is the leader in medical terminology solutions used by 450,000 physicians worldwide.  IMO’s clinical terminology provides quick and easy capture of complete clinical intent including mappings to the administrative code sets.  The combination of IMO terminologies with Linguamatics I2E enables clinical concepts to be accurately extracted from unstructured free text enabling the automation of physician documentation workflows.  This presentation will highlight how the two technologies combined, creates a solution to provide the data needed for decision support, interoperability and coding while reducing the clinical documentation burden that physicians face today.

Beyond Regular Expression: Use of Natural Language Processing in Mining Pathology Reports and Operating Notes

Lue-Yen Tucker, Kaiser Permanente

We conducted two women’s health research studies to assess factors associated with hysterectomy surgical route and a surgical technique ‘power morcellation’, as well as patient outcomes for women with benign conditions and leiomyosarcoma.  Most of the variables for our analyses were readily available from our robust electronic databases.  However, uterine weight, estimated blood loss (EBL), and use of power morcellation are unstructured data available only in the pathology reports and the operative notes.  In mid-2014, we were offered the opportunity to participate in a natural language processing workshop hosted by Linguamatics.

The name ‘I2E’ itself was eye-catching, but what captured our attention was its capabilities to isolate terminologies of interest by incorporating linguistic relationships, sentence co-occurrence, and build-in ontologies.  The graphic user interface enables the user to ‘build’ a query for specific tasks without writing lengthy programming code and allows for combining individual queries in various ways.  We were extremely motivated and excited to employ I2E in our studies as obtaining uterine weight, morcellation status and type, and EBL by utilizing keyword search followed by manually reviewing small segments of the source documents of ~32,000 women undergoing hysterectomy procedures for benign conditions 2008-2015 would be time intensive.  We believed I2E would be a far more efficient and accurate way to mine the unstructured data of interest.  In this presentation, we will provide a methodological overview of our experiences using I2E in (1) structuring our data; (2) identifying search terms of interest; (3) designing query scheme; and (4) validating results.

The language of cells: using NLP to build a resource to facilitate research into the nature of cellular communication.

Peter Hornbeck, Cell Signaling Technology

Automated Quality Review: detecting discrepancies in Blinded Data Reviews

Bryan Morganti, Pfizer & Paul Milligan, Linguamatics

Checking for errors in documents for FDA submission is a complex problem, which is currently undertaken as a manual and costly process.

Automation of this process -- ensuring that error checks are reliable, repeatable and accurate -- could speed it up and make it more efficient.

Pfizer have identified that analysis of Blinded Data Reviews is a key stage in the submission process: early enough in the process to prevent errors from propagating, but late enough that errors are difficult to detect manually. Pfizer and Linguamatics have worked together on a project to create a solution that allows reviewers to submit document packets and use I2E to generate reports summarizing the detected errors.

The solution has been implemented on premise at Pfizer, ensuring that the sensitivity of these documents is preserved.

Mastering the Data Universe: Turning your structured and unstructured data into critical elements for a patient, product or customer 360 degree view.

Anthony Sheaffer, Informatica

Mastering information from multiple data sources is challenging and even harder when you try to incorporate unstructured content as well.  We will discuss some of the market disruptors that are driving a new way of looking at data from a holistic perspective.  Whether your data resides in a Oracle table or a pdf document we will discuss how this information is being used to capture the 360 degree view to enable:

  1. Better treatment of patients
  2. Improve research and documentation
  3. Manage regulatory requirements
  4. Creating a self-subscribing system

Hands-on Training

We have hands-on training workshops run by our text mining experts available for total beginners, intermediate users and also advanced users.

Access the Hands-On Training Workshop selection guide, schedule and descriptions.

Reasons to attend

  1. Gain first-hand knowledge and experience on how structured and unstructured content can be mined to uncover valuable information
  2. Gain hands-on experience of NLP text mining through workshops and training
  3. Understand the challenges other pharmaceutical and healthcare professionals are facing and explore solutions to these challenges
  4. Gain a better understanding of NLP text mining and where it can fit into your organization
  5. Network and exchange ideas with peers and text mining experts
  6. Be the first to earn your Query User Certificate

Linguamatics I2E Query Hackathon

Linguamatics I2E Healthcare Hackathon 2017


This half-day event utilizing the Linguamatics I2E platform will allow teams to solve real-life challenges. Sign up for the 2017 Hackathon and benefit from a greater understanding of query strategies and learn new skills from colleagues and Linguamatics NLP text mining experts.   

The challenge

There is considerable interest in mining drug adverse events (AEs). Pharmaceutical companies and regulatory agencies need to record any new/unknown AEs found in real world situations, and extract AEs from drug labels for comparison. Hospital systems need to monitor for reports of AEs, and are keen to ensure patients do not receive medications which may hinder their clinical care. They may therefore want to automatically check known AEs using the most up-to-date drug label information available, diminishing any lag-time prescription AE errors within their EHR system.  

In this hackathon we will look at extracting AEs from drug labels and relating them to severity terms e.g. “mild” or “acute”. A particular challenge is to avoid extracting terms as AEs when they are actually the indication for a drug, or are negated e.g. “no evidence of headaches”.  

We will provide a corpus of drug label documents for training. We will also provide integrated automatic evaluation on both training and test documents so that participants can judge their progress.  

Participants will also have access to the full FDA Drug Labels index for data discovery from larger scale, unannotated data.


At the end of the session we will present some hints and tips for getting good results.
This year participants will be able to run automated evaluation against training data during the hackathon. Automated evaluation will also be run against unseen annotated test data.  

Who should attend?

I2E Healthcare Hackathon is for existing I2E users rather than new users. Participation for the hackathon is free but registration is required and spaces are limited so please sign up early.  

Query User Certificate

At Linguamatics, we're developing a Certificate Program for I2E users. We're starting with a Query User Certificate, and building up to Query Creator and Query Strategist Certificates.

To ensure the Certificates are meaningful, each will be associated with a set of learning objectives, learning materials, a training event (such as the TMS workshops), and an exam.

The TMS will be your first opportunity to try the Query User exam. It will be open to those who have just taken the Introduction to I2E courses, as well as more established users.

Travel and Accommodation

The Linguamatics Text Mining Summit will be held at the Wentworth by the Sea Hotel in New Castle, New Hampshire. Overlooking the Atlantic Ocean from the island of New Castle, Wentworth by the Sea, a Marriott Hotel & Spa, welcomes guests to one of last grand Portsmouth hotels. Our AAA Four-Diamond retreat features the three original Victorian towers constructed in the 1870s, along with 161 stately guest rooms and suites that blend the hotel’s historic elegance with luxurious, modern amenities.


Linguamatics has negotiated a special rate of $259.00/night for the conference. To book a room, please visit the link below. The conference rate is valid through September 10, 2017

Book your group rate for Linguamatics Text Mining Summit 2017

Getting to Wentworth by the Sea

Directions to Wentworth by the Sea can be found on the hotel's website.

Transportation from Boston-Logan Airport

C & J Bus Lines serves Logan Airport. The Portsmouth terminal is at Pease International Tradeport — 7 miles from the Wentworth by the Sea. Visitors will need to take a taxi to Wentworth. The bus schedule can be viewed on their website.

Tickets can be booked online.