Data Processing
With the LifeOmic Platform, in addition to being able to store omics data, you can index it. Once indexed, the data is available for deeper analysis using the Omics Explorer and other parts of the platform.
Definitions
- Project - A logical grouping of omics and clinical data within the LifeOmic Platform. A project is a dataset of the subjects assigned to the project.
- Reference Genome - A project specifies a reference genome for the data within it. Reference genomes GRCh37 and GRCh38 are currently supported in the LifeOmic Platform. It is not possible to mix data from different genomes within the same project.
- Omics Test - A logical grouping of omics data records that represent a single test or event.
- Subject - The biological source of omics data, such as patients involved in a study.
Overview
The basic process for adding omics data to the LifeOmic Platform involves the following steps:
- Upload the omics data file to the LifeOmic Platform. See How do I add files to the LifeOmic Platform?.
- Initiate the proper indexing action based on the omics data type (This involves associating the omics data with a Subject record in the LifeOmic Platform) or upload a manifest file that defines the genomic test and associates the files with a subject. See this section for more information.
- The LifeOmic Platform processes the file. During this time, the data is transformed and indexed to make it highly available and queryable. For certain data types, external knowledge bases are brought in and the omics data is annotated with additional information. One example of this is with short variants, information from sources such as ClinVar and dbSNP are added to the data records.
- After the data processing completes, you can view the omics data in Omics Explorer or act on the data with a feature, such as Tasks.
Processing Time Frames
The time frame for omics data to be available in the platform is dependent upon several factors:
- Number of omics files and overall size of the data being ingested at the same time
- Current overall size of the omics data in a project. Indexing times can increase as the size of the project increases.
Omics Test Manifest File
Using a manifest file allows you to initiate the indexing process by uploading a file to a project instead of requiring a command. The manifest file describes the omics test by listing the files included in the test and referencing a patient.
The name of the manifest
file must end with a .ga4gh.yml
extension. The file must be located within the project in the .lifeomic
folder or a subfolder with the .lifeomic
folder. Examples of valid file names are:
.lifeomic/manifests/test1.ga4gh.yml
and .lifeomic/test2.ga4gh.yml
.
Patient Matching
Matching a manifest file to a patient in the LifeOmic Platform can be done in one of two ways:
- Provide the Patient FHIR resource ID in the
patientId
property. - Provide at least two of the following three properties:
patientDOB
patientLastName
patientIdentifier
If all three properties are provided, a match is attempted using all three. If a match with all three cannot be found, then a match using a combination of two of the properties is attempted. If only two properties are provided in the manifest file, then a match is attempted using just the two properties.
The schema of the manifest file must match the following:
tests
- List of genomic tests. It is possible to have multiple tests in order to index a batch at one time.name
- A friendly name for the genomic test (required)testType
- The type of test that was performed (required)reference
- The reference build. Must beGRCh37
orGRCh38
and must match the reference build configured on the project. You cannot add tests from different reference builds in the same project. (required)patientIdentifier
- The patient identifier for the test. This can be an MRN or some other ID associated with the patient record (required with patientDOB or patientLastName)patientLastName
- The patient's last name (required with patientDOB or patientIdentifier)patientDOB
- The patient's date of birth (required with patientLastName or patientIdentifier)patientId
- The patient's FHIR resource ID (required if patientLastName, patientIdentifier, and patientDOB are not present)patientInfo
- Additional patient information that can be used to create the Patient FHIR resource if it does not already existfirstName
- Patient first namelastName
- Patient last namedob
- Patient date of birthgender
- Patient genderidentifiers
- List of patient identifiersvalue
- The identifier valuesystem
- The namespace for the identifier valuecodingSystem
- Identity of the terminology systemcodingCode
- Code symbol defined by the system
indexedDate
- The date the test was performed (defaults to current date if not provided)performerId
- The ID of the FHIROrganization
resource that performed the test or created the data.bodySite
- Coded value for the body site where the test specimen was taken frombodySiteSystem
- Identity of the terminology system for the body site valuebodySiteDisplay
- Friendly display name for body site valuesourceFile
- The name of the file that was used to generate the genomic test filesreportFile
- The name of a report file to associate with the genomic testmsi
- For a somatic test, the Microsatellite Instability result. Must below
,stable
,high
, orindeterminate
tmb
- For a somatic test, the Tumor Mutation Burden result. Must below
,high
,indeterminate
, orunknown
tmbScore
- For a somatic test, the Tumor Mutation Burden numeric scorefiles
- List of genomic files to include with the test (required)type
- Valid values areshortVariant
,read
,expression
,copyNumberVariant
, orstructuralVariant
(required)sequenceType
- Valid values aregermline
,somatic
,metastatic
,ctDNA
, orrna
(required)fileName
- The name of the file in the LifeOmic Platform Project (required)name
- A display name to give the test file. Defaults to the name of the filenormalize
- For ashortVariant
VCF file, set totrue
if the data should be normalized by the LifeOmic PlatformpassOnly
- For ashortVariant
VCF file, set totrue
to exclude variants that do not have a filter value ofPASS
updateSample
- For ashortVariant
VCF file, set totrue
to generate a unique sample name. Defaults to the value in the VCF
resources
- An optional list of files that will be added to the subjects Document ReferencesfileName
- The name of the file in the LifeOmic Platform Project (required)
Here is an example manifest file for a test that includes files for all omics types.
---
tests:
- name: Big Genomic Test
testType: Germline/Somatic Combo
patientIdentifier: '10005'
indexedDate: 2020/01/01
reference: GRCh37
bodySite: brain
bodySiteDisplay: Brain
bodySiteSystem: http://lifeomic.com/codes
msi: stable
tmb: low
tmbScore: 10
reportFile: reports/test1.pdf
files:
- type: copyNumberVariant
sequenceType: somatic
fileName: omics/test1.copynumber.csv
- type: shortVariant
sequenceType: somatic
fileName: omics/test1.somatic.vcf.gz
normalize: true
passFilter: true
updateSample: true
- type: expression
sequenceType: somatic
fileName: omics/test1.expression.rgel
- type: structuralVariant
sequenceType: somatic
fileName: omics/test1.structural.csv
- type: read
sequenceType: somatic
fileName: omics/test1.somatic.bam
- type: read
sequenceType: somatic
fileName: omics/test1.rna.bam
- type: shortVariant
sequenceType: germline
fileName: omics/test1.vcf.gz
- type: read
sequenceType: germline
fileName: omics/test1.germline.bam
Best Practices
- When adding files to a project, virtual folders can be created by
using the
/
delimiter in the name of the file. Example: Adding a file with a name of/path/file.txt
will make it appear that the filefile.txt
exists under thepath
folder when viewed under the Files tab of the LifeOmic Platform web app. - When submitting the requests to process an omics data file, many of the API
requests and CLI commands take a common set of fields that you should try to
provide values for:
-
Name (
name
) - Use a descriptive name here as this is the value that will show up in many of the user interfaces like the Omics Explorer. -
Test Type (
testType|test-type
) - Specify the type of test that was performed.noteThe
Name
andTest Type
fields are used to help identify an omics test to prevent duplicates should the same omics file be ingested again for a Subject. Be sure to try to use unique values for these to identify each test for a given Subject. Example: For Foundation Medicine, this could beHeme
. -
Indexed Date (
indexedDate|indexed-date
) - Specify the actual date that the test was performed or the data was created. The LifeOmic Platform will later capture the dates when data was added. -
Performer ID (
performerId|performer-id
) - Specify the ID of a FHIR Organization resource to represent the entity that performed the test or created the data. You can filter subjects by this value later to see which ones had tests performed by a certain provider. -
Body Site (
bodySite|body-site
) - Specify a code from a terminology system to identify the body site of the sample that was used to produce the test results. -
Body Site System (
bodySiteSystem|body-site-system
) - Specify the terminology system of the body site code. -
Body Site Display (
bodySiteDisplay|body-site-display
) - Specify a friendly display value for body site code.
-
Data Sources
Omics supports the following data sources.
Foundation Medicine
Foundation Medicine XML test files can be processed using the following methods:
- Foundation Tasks API
lo tasks create-foundation-xml-import
CLI subcommand- Omics Dashboard
A single omics test will be created for all variant types found in the XML file.
Also, note that the reportFileId|report-file-id
option allows a PDF file to be
linked to the LifeOmic Platform Subject.
NantOmics
NantOmics test files can be processed using the following methods:
- NantOmics Tasks API
lo tasks create-nantomics-vcf-import
CLI subcommand- Omics Dashboard
NantOmics tests normally provide separate files for short and structural
variants. A separate request has to be made to process each file. The type of
data being added is denoted by using the uploadType|upload--type
field of the
request to specify variant
or fnv
. A single omics test will be created for
the subject from both the short and structural variant files.
Ashion
Ashion GEM ExTra test TAR files can be processed using the following methods:
- Ashion Tasks API
lo tasks create-ashion-import
CLI subcommand- Omics Dashboard
A single omics test will be created for all the data types found in the GEM ExTra TAR file.
Short Variants
The LifeOmic platform processes VCF files to add genomic short variants to a project. The LifeOmic Platform runs a normalization process on the VCF to filter out any unsupported regions. One can also specify a list of VCFs to combine with any duplicates being removed. VCFs can be processed using the following methods:
lo genomics create-genomic-set
CLI subcommand- Python Genomics Module
- Omics Dashboard
Reads
BAM files can be processed by the LifeOmic Platform to add genomic read data to a project. The platform will create an index file for the BAM file. This allows the read data to be fetched and viewed in the web IGV. BAMs can be processed using the following methods:
lo genomics create-genomic-set
CLI subcommand- Python Genomics Module
- Omics Dashboard
RNA Expression
Upload a CSV file to add RNA expression data to a project. Use the following column schema:
sample_id,gene_id,gene_name,expression,raw_count,attributes,is_normalized,expression_unit
sample1,MT-TP,MT-TP,37.4555,41,"{'effectiveLength':'12','length':'68'}",True,tpm
sample1,MT-CYB,MT-CYB,4862.07,455676,"{'effectiveLength':'1027.42','length':'1141'}",True,tpm
Expression files can be processed using the following methods:
lo genomics create-rna-quantification-set
CLI subcommand- LifeOmic Platform Python Genomics Module
- Omics Dashboard
Copy Number Variants
Copy number variants can be added to a project by uploading a CSV file that uses the following column schema:
sample_id,gene,copy_number,status,attributes,chromosome,start_position,end_position,interpretation
sample1,HSD3B2,10.21,amplification,"{'SVTYPE':'<DUP>'}",chr1,119957553,119965658,N/A
sample1,HSD3B1,12.14,amplification,"{'SVTYPE':'<DUP>'}",chr1,120049825,120057677,N/A
File Schema
- sample_id - a required string value
- gene - a required string value
- copy_number - a required double value
- status - an optional string value
- attributes - an optional string value representing a JSON object to store meta data
- chromosome - an optional string value
- start_position - an optional long value, representing start position of the chromosome
- end_position - an optional long value, representing end position of the chromosome
- interpretation - an optional string value
For any optional string values, N/A
or .
is acceptable to indicate a missing value. It is not required.
Copy number variant files can be processed using the following methods:
lo genomics create-copy-number-set
CLI subcommand- LifeOmic Platform Python Genomics Module
- Omics Dashboard
Structural Variants
Structural variants can be added to a project by uploading a CSV file that uses the following column schema:
sample_id,gene1,gene2,effect,chromosome1,start_position1,end_position1,chromosome2,start_position2,end_position2,interpretation,sequence_type,in-frame,attributes
sample1,MRGPRF,N/A,translocation,chr11,68773114,68773114,chr17,70134939,70134939,N/A,somatic,N/A,"{'sv_type':'TRANSLOCATION'}"
sample1,DHX34,FUT1,duplication,chr19,47861484,47861484,chr19,49255988,49255988,N/A,somatic,N/A,"{'sv_type':'DUPLICATION'}"
File Schema
- sample_id - a required string value
- gene1 - a required string value,
N/A
a logical substitute for when one does not exist - gene2 - a required string value,
N/A
a logical substitute for when one does not exist - effect - an optional string value
- chromosome1 - an optional string value
- start_position1 - an optional long value, representing start position of chromosome1
- end_position1 - an optional long value, representing end position of chromosome1
- chromosome2 - an optional string value
- start_position2 - an optional long value, representing start position of chromosome2
- end_position2 - an optional long value, representing end position of chromosome2
- interpretation - an optional string value
- sequence_type - an optional string value
- in-frame - an optional string value
- attributes - an optional string value representing a JSON object to store
meta data ** NOTE: For any optional string values,
N/A
or.
may be used to indicate a missing value is acceptable. However this is not required.
Structural variant files can be processed using the following methods:
lo genomics create-structural-variant-set
CLI subcommand- LifeOmic Platform Python Genomics Module
- Omics Dashboard
Re-Ingesting
If any of the omics types have already been ingested for a test (remember, a
unique test is identified by several fields: the file itself, the test Name
and Test Type
fields) then they will not be re-ingested. Only any non-ingested
types will be processed. Also note, if for some reason you want to ingest data
already ingested, then the optional field (reIngestFile|re-ingest-file)
can be
added to a request. The existing omics test will be used, but the file will be
fully re-processed.
FHIR Resources
The following FHIR resources are generated as part of the variant processing workflow: