Harnessing the power of AI, ML & NLP to drive patient-focused drug development

What is Natural Language Processing (NLP)?

Natural Language Processing is a branch of Artificial Intelligence (AI) that deals with the interaction between machines and human languages. It’s a way for machines to understand, interpret, and generate human language.

For example, NLP can be used to understand a customer’s question in a customer service chatbot or to summarize a news article automatically.

At Semalytix, NLP is used to collect and then analyse patient and physician experience data captured from online sources to extract valuable insights that can be used to improve patient outcomes, improve and de-risk clinical trial design, create superior patient engagement campaigns and support research in the pharmaceutical industry.

It’s a powerful tool to help make sense of the vast amount of unstructured data available in our archives, which can be difficult for humans to process on their own.

What is Record Anonymisation?

Record anonymisation is the process of removing or masking identifying information from a patient’s medical records so that the individual cannot be identified. It is important for privacy, compliance and research. This is typically done by removing or replacing personal information such as name, address, and date of birth with a unique code or pseudonym. Semalytix employs its NLP algorithms to make safe all data records gathered, regardless of source, to ensure absolute trust in handling sensitive data. The company is committed to doing so in an ethically and legally compliant way to champion best practices while driving patient-centric drug development forward.

We make anonymisation a priority for three reasons:

  • Compliance: Anonymization is required by laws and regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States and the General Data Protection Regulation (GDPR) in the European Union, to ensure that patient data is handled properly.
  • Research: Anonymized patient records can be used for research purposes, such as studying disease trends or evaluating the effectiveness of different treatments. This allows scientists to gain valuable insights while also protecting patient privacy.

While we strive to make every patient’s voice heard, we are committed to doing so in an ethically and legally compliant way to champion best practices while driving patient-centric drug development forward.

What is Machine Translation (MT)?

Machine Translation is the use of AI to translate text from one language to another automatically. It is a subfield of Natural Language Processing (NLP) that uses algorithms to analyze and understand the source language in a written text and then generate an equivalent text in the target language.

The remarkable benefits of Machine Translation include the following:

  • Speed: Machine Translation can translate large amounts of text quickly, which can save time and increase efficiency.
  • Cost-effective: Machine Translation can reduce the cost associated with manual translation, especially for large volumes of text.
  • Multilingual support: Machine Translation can support a wide range of languages, making it possible to communicate with people who speak different languages.
  • Continuous improvement: Machine Translation systems learn and improve over time with the more data they process, thus becoming more accurate and efficient.

Semalytix is capable of reliable machine translation of patient and physician commentaries in 26 languages. This allows Semalytix to collect data globally, in stakeholders’ native language and build representative subpopulations for globe-spanning analyses.

What are Ontologies?

An ontology is a way of organizing and categorizing information. It’s like a map or a set of instructions that tells a machine how different pieces of information are related to each other.

Think of it like a medical library catalogue system, where books are organized into different categories, such as disease areas, drug classes, and patient-reported benefits. Each category is further broken down into subcategories, such as different symptoms, specific drug products and brands, and quality-of-life improvement aspects. Each such book receives a unique identifier, its own ISBN number.

At Semalytix, ontologies are used to organize information about different medical conditions, treatments, and drugs. Each category has its own unique identifier, such as an ICD-10 code for medical conditions, and the relationships between the categories are clearly defined, such as the relationship between a treatment and the medical condition it is used to treat.

The use of ontologies can be beneficial in many fields, but it’s particularly useful in healthcare, where large amounts of data need to be organized and analyzed. By using ontologies, a machine can quickly understand how different pieces of information are related, which can help to improve fact-based AI predictions, large data analysis, and actionable insight generation.

What are Microservice Architectures and Enterprise-level NLP?

Semalytix uses Microservice architectures for all of its AI and NLP components to design and build its Sphinx AI
framework that powers all analytical capabilities of our services.
This approach allows us to provide a huge variety of text processing and text analytics capabilities to design or compound enterprise-level AI and NLP strategies to tackle large unstructured document archives in healthcare.


Data Streams

Our data streams collect patient and physician comments from more than 100 million sources, such as social media, online patient communities, drug review sites, and forums. The comments are tagged and analyzed using natural language processing techniques to extract relevant information, such as symptoms, treatments, quality-of-life aspects, and overall sentiment. The enriched content is then provided via an API or export, allowing for easy access and analysis of the data. In addition to our patient and physician data stream, we are working on clinical trials, medical publications, and other relevant data sources.

Data normalisation

This text data normalization micro-service is designed to standardize and unify the formatting and terminology of text data from various sources. This enables the creation of links across sources, making it easier to compare and analyze the data. The micro-service applies natural language processing techniques to perform tasks such as case normalization, stemming, and lemmatization, which helps in standardizing the text data. Additionally, it can also be configured to recognize and replace specific terms and phrases to ensure consistency across different data sources. The output of the service is a unified, normalized version of the text data that can be easily linked and compared to other sources.

Healthcare Taggers

This micro-service is designed to tag and recognize medical entities within documents automatically. It uses natural language processing and machine learning techniques to identify and extract relevant information, such as symptoms, treatments, drugs, and medical conditions, from the text. The micro-service is specifically trained to understand the medical domain, which allows it to identify and tag relevant entities within the text accurately. The output of the service is a version of the document with the identified entities marked and tagged, making it easier to search, analyze and extract insights.

Machine translation of medical documents

This microservice is a machine translation tool that can translate medical documents in 26 languages, using neural machine translation techniques trained on a large medical corpus. The service is able to accurately translate medical terms, phrases and sentences, preserving meaning and context, it can also handle technical language and complex sentence structures. The output is a translated document that is easy to understand and can be integrated with other systems for tasks such as patient data tracking and clinical trial monitoring.

Custom pattern recognition

This microservice is a custom pattern recognition tool that is specifically designed for medical topics and can be equipped with different patterns, depending on the needs of the user. The patterns are created by data scientists at Semalytix and can include any topic relevant to pharmaceutical stakeholders, such as drug efficacy, side effects, patient sentiment, and more.