Commit 95ab78d0 authored by jcorvi's avatar jcorvi
Browse files

First Commit

parents
#
# Project specific excludes
#
tomcat
#
# Default excludes
#
# Binaries
*.7z
*.dmg
*.gz
*.iso
*.jar
*.rar
*.tar
*.zip
*.war
*.ear
*.sar
*.class
# Maven
target/
# IntelliJ project files
*.iml
*.iws
*.ipr
.idea/
# eclipse project file
.settings/
.classpath
.project
# NetBeans specific
nbproject/private/
build/
nbbuild/
dist/
nbdist/
nbactions.xml
nb-configuration.xml
# OS
.DS_Store
# Misc
*.swp
release.properties
pom.xml.releaseBackup
pom.xml.tag
#custom
pos/
FROM alpine:3.9
WORKDIR /usr/local/share/umlstagger
ARG UMLS_TAGGER_VERSION=1.0
COPY docker-build.sh /usr/local/bin/docker-build.sh
COPY src src
COPY jape_rules jape_rules
COPY config.properties config.properties
COPY pom.xml .
RUN mkdir logs
RUN chmod u=rwx,g=rwx,o=r /usr/local/share/umlstagger -R
RUN chmod u=rwx,g=rwx,o=rwx logs -R
RUN docker-build.sh ${UMLS_TAGGER_VERSION}
# umls-tagger
<b>Tagger of UMLS terminology</b>
# Description
UMLS integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services, including electronic health records.
The Metathesaurus, which contains over one million biomedical concepts from over 100 source vocabularies is used by the UMLS-TAGGER.
The umls-tagger annotate documents with the UMLS Metathesaurus terminology, given a configuration file indicating the sources and the semantic types to be used.
Uses the Metathesaurus Data Files, the RRF files (Rich Release Format) generated after the execution of MetamorphoSys witch is the installation and customization tool of UMLS. MetamorphoSys comes together with the UMLS download.
To overview UMLS please go to:
https://www.nlm.nih.gov/research/umls/new_users/online_learning/OVR_001.html
To install UMLS with MetamorphoSys please go to:
https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html
The output of the tagger will be GATE files with the given annotated terminology.
## Requirements
The installation of UMLS throught MetamorphoSys.
With the execution and installation of MetamorphoSys a subset of terminologies will be generated given your own configuration. For more information go to https://www.nlm.nih.gov/research/umls/implementation_resources/metamorphosys/help.html
The installation directory has to be provided for the execution of the tagger, has to be the META directory that contains the RRF files of Metathesaurus.
## Configuration
A configuration file has to be provided, if not a default one will be used, that contains information regarding with the sources and the semantic type that will be used during the tagging process.
Here an example:
[SOURCES]
#all sources present in the umls subset, no spaces between pipes
sources=ALL_SOURCES
#specific sources, this are the umls sources codes
#sources=MDR|SNOMEDCT_VET|
[SEMANTIC_TYPES]
#This describes the mapping between the UMLS classification of terms and the Labels that the user want to obtain.
#Each line is a mapping; separated by |.
#The first element is the UMLS semantic type, the second is only a description of the semantic type from umls; and the
#third one is the LABEL that we are going to obtain if a term is reached.
#SPECIES
T011|Amphibian|SPECIES
T010|Vertebrate|SPECIES
#ANATOMY
T018|Embryonic Structure|ANATOMY
[SEMANTIC_TYPES_END]
#This are the excluded semantic types by source.
[EXCLUDED_SEMANTIC_TYPES_BY_SOURCE]
SNOMEDCT_US=T033
SNOMEDCT_VET=T033
[EXCLUDED_SEMANTIC_TYPES_BY_SOURCE_END]
In this example all the sources of the MetamorphoSys subset will be used; and only the semantic types: Amphibian, Vertebrate and Embryonic Structure will be annotated. Each of these semantic types will
be mapping and annotated with the corresponding label. And if you consider that a specific source is generating a lot of noise in your tagging task, you can excluded in the [EXCLUDED_SEMANTIC_TYPES_BY_SOURCE].
In that example the semanticTypes T033 from the source SNOMEDCT_US and SNOMEDCT_VET is excluded.
One of the important tasks is to analyze and define witch sources and semantic types are important to your analysis;
[SOURCES], [SEMANTIC_TYPES] and [SEMANTIC_TYPES_END] are required and have to be present in the file.
### For clone this component
git clone --depth 1 https://github.com/inab/docker-textmining-tools.git umls-tagger
cd umls-tagger
git filter-branch --prune-empty --subdirectory-filter umls-tagger HEAD
### Build and Run the Docker
# To build the docker, just go into the umls-tagger folder and execute
docker build -t umls-tagger .
#To run the docker, just set the input_folder and the output
mkdir ${PWD}/umls_output; docker run --rm -u $UID -v /home/user/2018AB/EXAMPLE/META:/in_umls:ro -v ${PWD}/input_output:/in:ro -v ${PWD}/umls_output:/out:rw umls-tagger umls-tagger -u /in_umls -c /in/config.properties -i /in -o /out -d /out
Parameters:
<p>
-u input directory of the UMLS subset where the RRF files are located, usually are in ... META folder
</p>
<p>
-c configuration file that contains the semantic type mappings and the sources to be used during the mapping. If no configuration file is provided, a default one will be used.
</p>
<p>
-i input folder with the documents to annotated. The documents could be plain txt or xml GATE-formated documents.
</p>
<p>
-o output folder with the documents annotated in gate format.
</p>
<p>
-a Annotation set where the annotation will be included.
</p>
<p>
-d Optional destination folder of internal dictionary generated from the umls terminology, if not an internal path is used. This option is recommended if you want to have access to the gazetter generated with your configuration.
</p>
## Built With
* [Docker](https://www.docker.com/) - Docker Containers
* [Maven](https://maven.apache.org/) - Dependency Management
* [StanfordCoreNLP](https://stanfordnlp.github.io/CoreNLP/) - Stanford CoreNLP – Natural language software
* [GATE](https://gate.ac.uk/overview.html) - GATE: a full-lifecycle open source solution for text processing
## Versioning
We use [SemVer](http://semver.org/) for versioning. For the versions available, see the [tags on this repository](https://github.com/inab/docker-textmining-tools/edit/master/nlp-standard-preprocessing/tags).
## Authors
* **Javier Corvi**
## License
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the [LICENSE.md](LICENSE.md) file for details
[SOURCES]
#all sources present in the umls subset, no spaces between pipes
#sources=ALL_SOURCES
#specific sources, this are the umls sources codes
sources=MSH|MDR|OMIM|SNOMEDCT_US|SNOMEDCT_VET|ICD10CM|NCBI|WHO|HPO|NCI_CTCAE
[SEMANTIC_TYPES]
#This describes the mapping between the UMLS classification of terms and the Labels that the user want to obtain.
#Each line is a mapping; separated by |.
#The first element is the UMLS semantic type, the second is only a description of the semantic type from umls; and the
#third one is the LABEL that we are going to obtain if a term is reached.
#SPECIES
T011|Amphibian|SPECIMEN
T010|Vertebrate|SPECIMEN
T014|Reptile|SPECIMEN
T001|Organism|SPECIMEN
T015|Mammal|SPECIMEN
T013|Fish|SPECIMEN
T005|Virus|SPECIMEN
T012|Bird|SPECIMEN
#ANATOMY
T023|Body Part, Organ, or Organ Component|SPECIMEN
T018|Embryonic Structure|SPECIMEN
T021|Fully Formed Anatomical Structure|SPECIMEN
#T017|Anatomical Structure|ANATOMY
T024|Tissue|SPECIMEN
#FINDINGS
T033|Finding|FINDING
T034|Laboratory or Test Result|FINDING
#T037|Injury or Poisoning|FINDING
T046|Pathologic Function|FINDING
T184|Sign or Symptom|FINDING
T047|Disease or Syndrome|FINDING
#T048|Mental or Behavioral Dysfunction|FINDING
#T191|Neoplastic Process|FINDING
T019|Congenital Abnormality|FINDING
T020|Acquired Abnormality|FINDING
T190|Anatomical Abnormality|FINDING
#TEST
#T060|Diagnostic Procedure|STUDY_TESTCD
T063|Molecular Biology Research Technique|STUDY_TESTCD
T059|Laboratory Procedure|STUDY_TESTCD
#T096|Group|GROUP
[SEMANTIC_TYPES_END]
#This are the excluded semantic types by source
[EXCLUDED_SEMANTIC_TYPES_BY_SOURCE]
SNOMEDCT_US=T033
SNOMEDCT_VET=T033
[EXCLUDED_SEMANTIC_TYPES_BY_SOURCE_END]
terms.lst:UMLS:UMLS
\ No newline at end of file
This diff is collapsed.
#!/bin/sh
BASEDIR=/usr/local
UMLS_TAGGER_HOME="${BASEDIR}/share/umlstagger/"
UMLS_TAGGER_VERSION=1.0
# Exit on error
set -e
if [ $# -ge 1 ] ; then
UMLS_TAGGER_VERSION="$1"
fi
if [ -f /etc/alpine-release ] ; then
# Installing OpenJDK 8
apk add --update openjdk8-jre
# umls-tagger development dependencies
apk add openjdk8 git maven
else
# Runtime dependencies
apt-get update
apt-get install openjdk-8-jre
# The development dependencies
apt-get install openjdk-8-jdk git maven
fi
mvn clean install -DskipTests
#rename jar
mv target/umls-tagger-0.0.1-SNAPSHOT-jar-with-dependencies.jar umls-tagger-${UMLS_TAGGER_VERSION}.jar
cat > /usr/local/bin/umls-tagger <<EOF
#!/bin/sh
exec java \$JAVA_OPTS -jar "${UMLS_TAGGER_HOME}/umls-tagger-${UMLS_TAGGER_VERSION}.jar" -workdir "${UMLS_TAGGER_HOME}" "\$@"
EOF
chmod +x /usr/local/bin/umls-tagger
#exec java \$JAVA_OPTS -jar "${UMLS_TAGGER_HOME}/umls-tagger-${UMLS_TAGGER_VERSION}.jar" -workdir "${UMLS_TAGGER_HOME}" "\$@"
#delete target
rm -R target src pom.xml
#add bash for nextflow
apk add bash
if [ -f /etc/alpine-release ] ; then
# Removing not needed tools
apk del openjdk8 git maven
rm -rf /var/cache/apk/*
else
apt-get remove openjdk-8-jdk git maven
rm -rf /var/cache/dpkg
fi
Imports: {
import static gate.Utils.*;
}
Phase:firstphase
Input: Lookup
Options: control = appelt
Rule: basic_mapp
(
{Lookup.majorType=="UMLS"}
)
:lookup
-->
{
gate.AnnotationSet lookup = (gate.AnnotationSet) bindings.get("lookup");
gate.Annotation ann = (gate.Annotation) lookup.iterator().next();
String content = stringFor(doc, ann);
FeatureMap lookupFeatures = ann.getFeatures();
gate.FeatureMap features = Factory.newFeatureMap();
lookupFeatures.remove("majorType");
lookupFeatures.remove("minorType");
features.put("SOURCE","UMLS");
features.put("text",content);
String minorType = lookupFeatures.get("LABEL").toString();
lookupFeatures.remove("LABEL");
features.putAll(lookupFeatures);
try{
outputAS.add(lookup.firstNode().getOffset(),lookup.lastNode().getOffset(),minorType, features);
}catch(InvalidOffsetException e){
throw new LuckyException(e);
}
//remove old lookup
//inputAS.remove(ann);
}
\ No newline at end of file
Phase:secondphase
Input: Lookup
Options: control = all
Rule: delete_rule
(
{Lookup}
)
:lookup
-->
{
gate.AnnotationSet lookup = (gate.AnnotationSet) bindings.get("lookup");
gate.Annotation ann = (gate.Annotation) lookup.iterator().next();
inputAS.remove(ann);
}
\ No newline at end of file
MultiPhase: Main
Phases:
basic_mapping
delete_lookups
\ No newline at end of file
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>es.bsc.inb.nlp</groupId>
<artifactId>umls-tagger</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>umls-tagger</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>uk.ac.gate</groupId>
<artifactId>gate-core</artifactId>
<version>8.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.4</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<mainClass>
es.bsc.inb.umlstagger.main.App
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
\ No newline at end of file
package es.bsc.inb.umlstagger.main;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.DefaultParser;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.log4j.Logger;
import gate.Corpus;
import gate.Document;
import gate.Factory;
import gate.FeatureMap;
import gate.Gate;
import gate.LanguageAnalyser;
import gate.ProcessingResource;
import gate.creole.Plugin;
import gate.creole.SerialAnalyserController;
import gate.util.ExtensionFileFilter;
import gate.util.GateException;
/**
* UMLS Tagger.
* Given the UMLS Terminology this tool annotate documents with a given configuration of sources and semantic types.
*
* @author jcorvi
*
*/
public class App {
static final Logger log = Logger.getLogger("log");
static Map<String,String> semanticTypesMap = new HashMap<String,String>();
static Map<String,String> semanticTypesMapExcluded = new HashMap<String,String>();
static List<String> sourceList = new ArrayList<String>();
public static void main( String[] args ){
Options options = new Options();
Option input = new Option("i", "input", true, "input directory path");
input.setRequired(true);
options.addOption(input);
Option output = new Option("o", "output", true, "output directory path");
output.setRequired(true);
options.addOption(output);
Option set = new Option("a", "annotation_set", true, "Annotation set where the annotation will be included");
set.setRequired(true);
options.addOption(set);
Option inputUMLSDirectory = new Option("u", "input_umls_directory", true, "input directory where the RRF files are located, usually are in ... META folder");
inputUMLSDirectory.setRequired(true);
options.addOption(inputUMLSDirectory);
Option configuration_file = new Option("c", "configuration_file", true, "it contains the semantic type mappings and the sources to be used during the mapping. "
+ " If no configuration file is provided, a default one will be used. ");
configuration_file.setRequired(false);
options.addOption(configuration_file);
Option workdir = new Option("workdir", "workdir", true, "workDir directory path");
workdir.setRequired(false);
options.addOption(workdir);
Option dictOutput = new Option("d", "dictOutput", true, "Optional destination folder of internal dictionary generated from the umls terminology, if not an internal path is used. This option is recommended if you want to have access to the gazetter generated with your configuration.");
dictOutput.setRequired(false);
options.addOption(dictOutput);
CommandLineParser parser = new DefaultParser();
HelpFormatter formatter = new HelpFormatter();
CommandLine cmd = null;
try {
cmd = parser.parse(options, args);
} catch (ParseException e) {
System.out.println(e.getMessage());
formatter.printHelp("utility-name", options);
System.exit(1);
}
String inputFilePath = cmd.getOptionValue("input");
String outputFilePath = cmd.getOptionValue("output");
String workdirPath = cmd.getOptionValue("workdir");
String umlsDirectoryPath = cmd.getOptionValue("input_umls_directory");
String configurationFilePath = cmd.getOptionValue("configuration_file");
String annotationSet = cmd.getOptionValue("annotation_set");
String dictOutputPath = cmd.getOptionValue("dictOutput");
if (!java.nio.file.Files.isDirectory(Paths.get(umlsDirectoryPath))) {
System.out.println("Please set the input_umls_directory");
System.exit(1);
}
if (annotationSet==null) {
System.out.println("Please set the annotation set where the annotation will be included");
System.exit(1);
}
if (!java.nio.file.Files.isDirectory(Paths.get(inputFilePath))) {
System.out.println("Please set the inputDirectoryPath ");
System.exit(1);
}
File outputDirectory = new File(outputFilePath);
if(!outputDirectory.exists())
outputDirectory.mkdirs();
if(workdirPath==null) {
workdirPath="";
}
try {
loadConfigurationFile(workdirPath, configurationFilePath);
}catch(Exception e) {
System.out.println("Exception ocurred see the log for more information");
e.printStackTrace();
System.exit(1);
}
String listsDefinitionsPath = null;
try {
String dictFolderPath = workdirPath + "dictionary";
if(dictOutputPath!=null) {
if (!java.nio.file.Files.isDirectory(Paths.get(inputFilePath))) {
System.out.println("The outputDictionaryFolder : " + dictOutputPath + " do not exist, no output of dictionary will be done ");
}else {
dictFolderPath = dictOutputPath;
}
}
File dictFolder = new File(dictFolderPath);
if(!dictFolder.exists())
dictFolder.mkdirs();
String gazzeteer = dictFolder + File.separator + "terms.lst";
listsDefinitionsPath = dictFolder + File.separator + "lists.def";
generateDictionary(umlsDirectoryPath, gazzeteer, listsDefinitionsPath);
}catch(Exception e) {
System.out.println("Exception ocurred see the log for more information");
e.printStackTrace();
System.exit(1);
}
try {
Gate.init();
} catch (GateException e) {
System.out.println("App::main :: Gate Exception ");
e.printStackTrace();
System.exit(1);
}
try {
String japeRules = workdirPath+"jape_rules/main.jape";
process(inputFilePath, outputFilePath, listsDefinitionsPath, japeRules, annotationSet);
} catch (IOException e) {
e.printStackTrace();
System.exit(1);
} catch (GateException e) {
e.printStackTrace();
System.exit(1);
}
}
/**
* Load semantic type information from the MRSTY file and the given configuration
* @param mrstyPath
* @return
* @throws IOException
*/
private static Map<String, String[]> loadSemanticTypeData(String mrstyPath) throws IOException {
HashMap<String, String[]> map = new HashMap<String, String[]>();