Data Processing

With the LifeOmic Platform, in addition to being able to store omics data, you can index it. Once indexed, the data is available for deeper analysis using the Omics Explorer and other parts of the platform.

Definitions

Project - A logical grouping of omics and clinical data within the LifeOmic Platform. A project is a dataset of the subjects assigned to the project.
Reference Genome - A project specifies a reference genome for the data within it. Reference genomes GRCh37 and GRCh38 are currently supported in the LifeOmic Platform. It is not possible to mix data from different genomes within the same project.
Omics Test - A logical grouping of omics data records that represent a single test or event.
Subject - The biological source of omics data, such as patients involved in a study.

Overview

The basic process for adding omics data to the LifeOmic Platform involves the following steps:

Upload the omics data file to the LifeOmic Platform. See How do I add files to the LifeOmic Platform?.
Initiate the proper indexing action based on the omics data type (This involves associating the omics data with a Subject record in the LifeOmic Platform) or upload a manifest file that defines the genomic test and associates the files with a subject. See this section for more information.
The LifeOmic Platform processes the file. During this time, the data is transformed and indexed to make it highly available and queryable. For certain data types, external knowledge bases are brought in and the omics data is annotated with additional information. One example of this is with short variants, information from sources such as ClinVar and dbSNP are added to the data records.
After the data processing completes, you can view the omics data in Omics Explorer or act on the data with a feature, such as Tasks.

Processing Time Frames

The time frame for omics data to be available in the platform is dependent upon several factors:

Number of omics files and overall size of the data being ingested at the same time
Current overall size of the omics data in a project. Indexing times can increase as the size of the project increases.

Omics Test Manifest File

Using a manifest file allows you to initiate the indexing process by uploading a file to a project instead of requiring a command. The manifest file describes the omics test by listing the files included in the test and referencing a patient.

The name of the manifest file must end with a .ga4gh.yml extension. The file must be located within the project in the .lifeomic folder or a subfolder with the .lifeomic folder. Examples of valid file names are: .lifeomic/manifests/test1.ga4gh.yml and .lifeomic/test2.ga4gh.yml.

Patient Matching

Matching a manifest file to a patient in the LifeOmic Platform can be done in one of two ways:

Provide the Patient FHIR resource ID in the patientId property.
Provide at least two of the following three properties:
- patientDOB
- patientLastName
- patientIdentifier

If all three properties are provided, a match is attempted using all three. If a match with all three cannot be found, then a match using a combination of two of the properties is attempted. If only two properties are provided in the manifest file, then a match is attempted using just the two properties.

The schema of the manifest file must match the following:

tests - List of genomic tests. It is possible to have multiple tests in order to index a batch at one time.
- name - A friendly name for the genomic test (required)
- testType - The type of test that was performed (required)
- reference - The reference build. Must be GRCh37 or GRCh38 and must match the reference build configured on the project. You cannot add tests from different reference builds in the same project. (required)
- patientIdentifier - The patient identifier for the test. This can be an MRN or some other ID associated with the patient record (required with patientDOB or patientLastName)
- patientLastName - The patient's last name (required with patientDOB or patientIdentifier)
- patientDOB - The patient's date of birth (required with patientLastName or patientIdentifier)
- patientId - The patient's FHIR resource ID (required if patientLastName, patientIdentifier, and patientDOB are not present)
- patientInfo - Additional patient information that can be used to create the Patient FHIR resource if it does not already exist
  - firstName - Patient first name
  - lastName - Patient last name
  - dob - Patient date of birth
  - gender - Patient gender
  - identifiers - List of patient identifiers
    - value - The identifier value
    - system - The namespace for the identifier value
    - codingSystem - Identity of the terminology system
    - codingCode - Code symbol defined by the system
- indexedDate - The date the test was performed (defaults to current date if not provided)
- performerId - The ID of the FHIR Organization resource that performed the test or created the data.
- bodySite - Coded value for the body site where the test specimen was taken from
- bodySiteSystem - Identity of the terminology system for the body site value
- bodySiteDisplay - Friendly display name for body site value
- sourceFile - The name of the file that was used to generate the genomic test files
- reportFile - The name of a report file to associate with the genomic test
- msi - For a somatic test, the Microsatellite Instability result. Must be low, stable, high, or indeterminate
- tmb - For a somatic test, the Tumor Mutation Burden result. Must be low, high, indeterminate, or unknown
- tmbScore - For a somatic test, the Tumor Mutation Burden numeric score
- files - List of genomic files to include with the test (required)
  - type - Valid values are shortVariant, read, expression, copyNumberVariant, or structuralVariant (required)
  - sequenceType - Valid values are germline, somatic, metastatic, ctDNA, or rna (required)
  - fileName - The name of the file in the LifeOmic Platform Project (required)
  - name - A display name to give the test file. Defaults to the name of the file
  - normalize - For a shortVariant VCF file, set to true if the data should be normalized by the LifeOmic Platform
  - passOnly - For a shortVariant VCF file, set to true to exclude variants that do not have a filter value of PASS
  - updateSample - For a shortVariant VCF file, set to true to generate a unique sample name. Defaults to the value in the VCF
- resources - An optional list of files that will be added to the subjects Document References
  - fileName - The name of the file in the LifeOmic Platform Project (required)

Here is an example manifest file for a test that includes files for all omics types.

---
tests:
    - name: Big Genomic Test
      testType: Germline/Somatic Combo
      patientIdentifier: '10005'
      indexedDate: 2020/01/01
      reference: GRCh37
      bodySite: brain
      bodySiteDisplay: Brain
      bodySiteSystem: http://lifeomic.com/codes
      msi: stable
      tmb: low
      tmbScore: 10
      reportFile: reports/test1.pdf
      files:
          - type: copyNumberVariant
            sequenceType: somatic
            fileName: omics/test1.copynumber.csv
          - type: shortVariant
            sequenceType: somatic
            fileName: omics/test1.somatic.vcf.gz
            normalize: true
            passFilter: true
            updateSample: true
          - type: expression
            sequenceType: somatic
            fileName: omics/test1.expression.rgel
          - type: structuralVariant
            sequenceType: somatic
            fileName: omics/test1.structural.csv
          - type: read
            sequenceType: somatic
            fileName: omics/test1.somatic.bam
          - type: read
            sequenceType: somatic
            fileName: omics/test1.rna.bam
          - type: shortVariant
            sequenceType: germline
            fileName: omics/test1.vcf.gz
          - type: read
            sequenceType: germline
            fileName: omics/test1.germline.bam

Best Practices

When adding files to a project, virtual folders can be created by using the / delimiter in the name of the file. Example: Adding a file with a name of /path/file.txt will make it appear that the file file.txt exists under the path folder when viewed under the Files tab of the LifeOmic Platform web app.
When submitting the requests to process an omics data file, many of the API requests and CLI commands take a common set of fields that you should try to provide values for:
- Name (name) - Use a descriptive name here as this is the value that will show up in many of the user interfaces like the Omics Explorer.
- Test Type (testType|test-type) - Specify the type of test that was performed.
  
  note
  The Name and Test Type fields are used to help identify an omics test to prevent duplicates should the same omics file be ingested again for a Subject. Be sure to try to use unique values for these to identify each test for a given Subject. Example: For Foundation Medicine, this could be Heme.
- Indexed Date (indexedDate|indexed-date) - Specify the actual date that the test was performed or the data was created. The LifeOmic Platform will later capture the dates when data was added.
- Performer ID (performerId|performer-id) - Specify the ID of a FHIR Organization resource to represent the entity that performed the test or created the data. You can filter subjects by this value later to see which ones had tests performed by a certain provider.
- Body Site (bodySite|body-site) - Specify a code from a terminology system to identify the body site of the sample that was used to produce the test results.
- Body Site System (bodySiteSystem|body-site-system) - Specify the terminology system of the body site code.
- Body Site Display (bodySiteDisplay|body-site-display) - Specify a friendly display value for body site code.

Data Sources

Omics supports the following data sources.

Foundation Medicine

Foundation Medicine XML test files can be processed using the following methods:

Foundation Tasks API
lo tasks create-foundation-xml-import CLI subcommand
Omics Dashboard

A single omics test will be created for all variant types found in the XML file. Also, note that the reportFileId|report-file-id option allows a PDF file to be linked to the LifeOmic Platform Subject.

NantOmics

NantOmics test files can be processed using the following methods:

NantOmics Tasks API
lo tasks create-nantomics-vcf-import CLI subcommand
Omics Dashboard

NantOmics tests normally provide separate files for short and structural variants. A separate request has to be made to process each file. The type of data being added is denoted by using the uploadType|upload--type field of the request to specify variant or fnv. A single omics test will be created for the subject from both the short and structural variant files.

Ashion

Ashion GEM ExTra test TAR files can be processed using the following methods:

Ashion Tasks API
lo tasks create-ashion-import CLI subcommand
Omics Dashboard

A single omics test will be created for all the data types found in the GEM ExTra TAR file.

Short Variants

The LifeOmic platform processes VCF files to add genomic short variants to a project. The LifeOmic Platform runs a normalization process on the VCF to filter out any unsupported regions. One can also specify a list of VCFs to combine with any duplicates being removed. VCFs can be processed using the following methods:

lo genomics create-genomic-set CLI subcommand
Python Genomics Module
Omics Dashboard

Reads

BAM files can be processed by the LifeOmic Platform to add genomic read data to a project. The platform will create an index file for the BAM file. This allows the read data to be fetched and viewed in the web IGV. BAMs can be processed using the following methods:

lo genomics create-genomic-set CLI subcommand
Python Genomics Module
Omics Dashboard

RNA Expression

Upload a CSV file to add RNA expression data to a project. Use the following column schema:

sample_id,gene_id,gene_name,expression,raw_count,attributes,is_normalized,expression_unit
sample1,MT-TP,MT-TP,37.4555,41,"{'effectiveLength':'12','length':'68'}",True,tpm
sample1,MT-CYB,MT-CYB,4862.07,455676,"{'effectiveLength':'1027.42','length':'1141'}",True,tpm

Expression files can be processed using the following methods:

lo genomics create-rna-quantification-set CLI subcommand
LifeOmic Platform Python Genomics Module
Omics Dashboard

Copy Number Variants

Copy number variants can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene,copy_number,status,attributes,chromosome,start_position,end_position,interpretation
sample1,HSD3B2,10.21,amplification,"{'SVTYPE':'<DUP>'}",chr1,119957553,119965658,N/A
sample1,HSD3B1,12.14,amplification,"{'SVTYPE':'<DUP>'}",chr1,120049825,120057677,N/A

File Schema

sample_id - a required string value
gene - a required string value
copy_number - a required double value
status - an optional string value
attributes - an optional string value representing a JSON object to store meta data
chromosome - an optional string value
start_position - an optional long value, representing start position of the chromosome
end_position - an optional long value, representing end position of the chromosome
interpretation - an optional string value

note

For any optional string values, N/A or . is acceptable to indicate a missing value. It is not required.

Copy number variant files can be processed using the following methods:

lo genomics create-copy-number-set CLI subcommand
LifeOmic Platform Python Genomics Module
Omics Dashboard

Structural Variants

Structural variants can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene1,gene2,effect,chromosome1,start_position1,end_position1,chromosome2,start_position2,end_position2,interpretation,sequence_type,in-frame,attributes
sample1,MRGPRF,N/A,translocation,chr11,68773114,68773114,chr17,70134939,70134939,N/A,somatic,N/A,"{'sv_type':'TRANSLOCATION'}"
sample1,DHX34,FUT1,duplication,chr19,47861484,47861484,chr19,49255988,49255988,N/A,somatic,N/A,"{'sv_type':'DUPLICATION'}"

File Schema

sample_id - a required string value
gene1 - a required string value, N/A a logical substitute for when one does not exist
gene2 - a required string value, N/A a logical substitute for when one does not exist
effect - an optional string value
chromosome1 - an optional string value
start_position1 - an optional long value, representing start position of chromosome1
end_position1 - an optional long value, representing end position of chromosome1
chromosome2 - an optional string value
start_position2 - an optional long value, representing start position of chromosome2
end_position2 - an optional long value, representing end position of chromosome2
interpretation - an optional string value
sequence_type - an optional string value
in-frame - an optional string value
attributes - an optional string value representing a JSON object to store meta data ** NOTE: For any optional string values, N/A or . may be used to indicate a missing value is acceptable. However this is not required.

Structural variant files can be processed using the following methods:

lo genomics create-structural-variant-set CLI subcommand
LifeOmic Platform Python Genomics Module
Omics Dashboard

Re-Ingesting

If any of the omics types have already been ingested for a test (remember, a unique test is identified by several fields: the file itself, the test Name and Test Type fields) then they will not be re-ingested. Only any non-ingested types will be processed. Also note, if for some reason you want to ingest data already ingested, then the optional field (reIngestFile|re-ingest-file) can be added to a request. The existing omics test will be used, but the file will be fully re-processed.

FHIR Resources

The following FHIR resources are generated as part of the variant processing workflow:

Data Processing

Definitions​

Overview​

Processing Time Frames​

Omics Test Manifest File​

Patient Matching​

Best Practices​

Data Sources​

Foundation Medicine​

NantOmics​

Ashion​

Short Variants​

Reads​

RNA Expression​

Copy Number Variants​

File Schema​

Structural Variants​

File Schema​

Re-Ingesting​

FHIR Resources​

Definitions

Overview

Processing Time Frames

Omics Test Manifest File

Patient Matching

Best Practices

Data Sources

Foundation Medicine

NantOmics

Ashion

Short Variants

Reads

RNA Expression

Copy Number Variants

File Schema

Structural Variants

File Schema

Re-Ingesting

FHIR Resources