Skip to main content

Data Processing

With the LifeOmic Platform, in addition to being able to store omics data, you can index it. Once indexed, the data is available for deeper analysis using the Omics Explorer and other parts of the platform.

Definitions

  • Project - A logical grouping of omics and clinical data within the LifeOmic Platform. A project is a dataset of the subjects assigned to the project.
  • Reference Genome - A project specifies a reference genome for the data within it. Reference genomes GRCh37 and GRCh38 are currently supported in the LifeOmic Platform. It is not possible to mix data from different genomes within the same project.
  • Omics Test - A logical grouping of omics data records that represent a single test or event.
  • Subject - The biological source of omics data, such as patients involved in a study.

Overview

The basic process for adding omics data to the LifeOmic Platform involves the following steps:

  1. Upload the omics data file to the LifeOmic Platform. See How do I add files to the LifeOmic Platform?.
  2. Initiate the proper indexing action based on the omics data type (This involves associating the omics data with a Subject record in the LifeOmic Platform) or upload a manifest file that defines the genomic test and associates the files with a subject. See this section for more information.
  3. The LifeOmic Platform processes the file. During this time, the data is transformed and indexed to make it highly available and queryable. For certain data types, external knowledge bases are brought in and the omics data is annotated with additional information. One example of this is with short variants, information from sources such as ClinVar and dbSNP are added to the data records.
  4. After the data processing completes, you can view the omics data in Omics Explorer or act on the data with a feature, such as Tasks.

Processing Time Frames

The time frame for omics data to be available in the platform is dependent upon several factors:

  • Number of omics files and overall size of the data being ingested at the same time
  • Current overall size of the omics data in a project. Indexing times can increase as the size of the project increases.

Omics Test Manifest File

Using a manifest file allows you to initiate the indexing process by uploading a file to a project instead of requiring a command. The manifest file describes the omics test by listing the files included in the test and referencing a patient.

The name of the manifest file must end with a .ga4gh.yml extension. The file must be located within the project in the .lifeomic folder or a subfolder with the .lifeomic folder. Examples of valid file names are: .lifeomic/manifests/test1.ga4gh.yml and .lifeomic/test2.ga4gh.yml.

Patient Matching

Matching a manifest file to a patient in the LifeOmic Platform can be done in one of two ways:

  • Provide the Patient FHIR resource ID in the patientId property.
  • Provide at least two of the following three properties:
    • patientDOB
    • patientLastName
    • patientIdentifier

If all three properties are provided, a match is attempted using all three. If a match with all three cannot be found, then a match using a combination of two of the properties is attempted. If only two properties are provided in the manifest file, then a match is attempted using just the two properties.

The schema of the manifest file must match the following:

  • tests - List of genomic tests. It is possible to have multiple tests in order to index a batch at one time.
    • name - A friendly name for the genomic test (required)
    • testType - The type of test that was performed (required)
    • reference - The reference build. Must be GRCh37 or GRCh38 and must match the reference build configured on the project. You cannot add tests from different reference builds in the same project. (required)
    • patientIdentifier - The patient identifier for the test. This can be an MRN or some other ID associated with the patient record (required with patientDOB or patientLastName)
    • patientLastName - The patient's last name (required with patientDOB or patientIdentifier)
    • patientDOB - The patient's date of birth (required with patientLastName or patientIdentifier)
    • patientId - The patient's FHIR resource ID (required if patientLastName, patientIdentifier, and patientDOB are not present)
    • patientInfo - Additional patient information that can be used to create the Patient FHIR resource if it does not already exist
      • firstName - Patient first name
      • lastName - Patient last name
      • dob - Patient date of birth
      • gender - Patient gender
      • identifiers - List of patient identifiers
        • value - The identifier value
        • system - The namespace for the identifier value
        • codingSystem - Identity of the terminology system
        • codingCode - Code symbol defined by the system
    • indexedDate - The date the test was performed (defaults to current date if not provided)
    • performerId - The ID of the FHIR Organization resource that performed the test or created the data.
    • bodySite - Coded value for the body site where the test specimen was taken from
    • bodySiteSystem - Identity of the terminology system for the body site value
    • bodySiteDisplay - Friendly display name for body site value
    • sourceFile - The name of the file that was used to generate the genomic test files
    • reportFile - The name of a report file to associate with the genomic test
    • msi - For a somatic test, the Microsatellite Instability result. Must be low, stable, high, or indeterminate
    • tmb - For a somatic test, the Tumor Mutation Burden result. Must be low, high, indeterminate, or unknown
    • tmbScore - For a somatic test, the Tumor Mutation Burden numeric score
    • files - List of genomic files to include with the test (required)
      • type - Valid values are shortVariant, read, expression, copyNumberVariant, or structuralVariant (required)
      • sequenceType - Valid values are germline, somatic, metastatic, ctDNA, or rna (required)
      • fileName - The name of the file in the LifeOmic Platform Project (required)
      • name - A display name to give the test file. Defaults to the name of the file
      • normalize - For a shortVariant VCF file, set to true if the data should be normalized by the LifeOmic Platform
      • passOnly - For a shortVariant VCF file, set to true to exclude variants that do not have a filter value of PASS
      • updateSample - For a shortVariant VCF file, set to true to generate a unique sample name. Defaults to the value in the VCF
    • resources - An optional list of files that will be added to the subjects Document References
      • fileName - The name of the file in the LifeOmic Platform Project (required)

Here is an example manifest file for a test that includes files for all omics types.

---
tests:
- name: Big Genomic Test
testType: Germline/Somatic Combo
patientIdentifier: '10005'
indexedDate: 2020/01/01
reference: GRCh37
bodySite: brain
bodySiteDisplay: Brain
bodySiteSystem: http://lifeomic.com/codes
msi: stable
tmb: low
tmbScore: 10
reportFile: reports/test1.pdf
files:
- type: copyNumberVariant
sequenceType: somatic
fileName: omics/test1.copynumber.csv
- type: shortVariant
sequenceType: somatic
fileName: omics/test1.somatic.vcf.gz
normalize: true
passFilter: true
updateSample: true
- type: expression
sequenceType: somatic
fileName: omics/test1.expression.rgel
- type: structuralVariant
sequenceType: somatic
fileName: omics/test1.structural.csv
- type: read
sequenceType: somatic
fileName: omics/test1.somatic.bam
- type: read
sequenceType: somatic
fileName: omics/test1.rna.bam
- type: shortVariant
sequenceType: germline
fileName: omics/test1.vcf.gz
- type: read
sequenceType: germline
fileName: omics/test1.germline.bam

Best Practices

  • When adding files to a project, virtual folders can be created by using the / delimiter in the name of the file. Example: Adding a file with a name of /path/file.txt will make it appear that the file file.txt exists under the path folder when viewed under the Files tab of the LifeOmic Platform web app.
  • When submitting the requests to process an omics data file, many of the API requests and CLI commands take a common set of fields that you should try to provide values for:
    • Name (name) - Use a descriptive name here as this is the value that will show up in many of the user interfaces like the Omics Explorer.

    • Test Type (testType|test-type) - Specify the type of test that was performed.

      note

      The Name and Test Type fields are used to help identify an omics test to prevent duplicates should the same omics file be ingested again for a Subject. Be sure to try to use unique values for these to identify each test for a given Subject. Example: For Foundation Medicine, this could be Heme.

    • Indexed Date (indexedDate|indexed-date) - Specify the actual date that the test was performed or the data was created. The LifeOmic Platform will later capture the dates when data was added.

    • Performer ID (performerId|performer-id) - Specify the ID of a FHIR Organization resource to represent the entity that performed the test or created the data. You can filter subjects by this value later to see which ones had tests performed by a certain provider.

    • Body Site (bodySite|body-site) - Specify a code from a terminology system to identify the body site of the sample that was used to produce the test results.

    • Body Site System (bodySiteSystem|body-site-system) - Specify the terminology system of the body site code.

    • Body Site Display (bodySiteDisplay|body-site-display) - Specify a friendly display value for body site code.

Data Sources

Omics supports the following data sources.

Foundation Medicine

Foundation Medicine XML test files can be processed using the following methods:

A single omics test will be created for all variant types found in the XML file. Also, note that the reportFileId|report-file-id option allows a PDF file to be linked to the LifeOmic Platform Subject.

NantOmics

NantOmics test files can be processed using the following methods:

NantOmics tests normally provide separate files for short and structural variants. A separate request has to be made to process each file. The type of data being added is denoted by using the uploadType|upload--type field of the request to specify variant or fnv. A single omics test will be created for the subject from both the short and structural variant files.

Ashion

Ashion GEM ExTra test TAR files can be processed using the following methods:

A single omics test will be created for all the data types found in the GEM ExTra TAR file.

Short Variants

The LifeOmic platform processes VCF files to add genomic short variants to a project. The LifeOmic Platform runs a normalization process on the VCF to filter out any unsupported regions. One can also specify a list of VCFs to combine with any duplicates being removed. VCFs can be processed using the following methods:

Reads

BAM files can be processed by the LifeOmic Platform to add genomic read data to a project. The platform will create an index file for the BAM file. This allows the read data to be fetched and viewed in the web IGV. BAMs can be processed using the following methods:

RNA Expression

Upload a CSV file to add RNA expression data to a project. Use the following column schema:

sample_id,gene_id,gene_name,expression,raw_count,attributes,is_normalized,expression_unit
sample1,MT-TP,MT-TP,37.4555,41,"{'effectiveLength':'12','length':'68'}",True,tpm
sample1,MT-CYB,MT-CYB,4862.07,455676,"{'effectiveLength':'1027.42','length':'1141'}",True,tpm

Expression files can be processed using the following methods:

Copy Number Variants

Copy number variants can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene,copy_number,status,attributes,chromosome,start_position,end_position,interpretation
sample1,HSD3B2,10.21,amplification,"{'SVTYPE':'<DUP>'}",chr1,119957553,119965658,N/A
sample1,HSD3B1,12.14,amplification,"{'SVTYPE':'<DUP>'}",chr1,120049825,120057677,N/A
File Schema
  • sample_id - a required string value
  • gene - a required string value
  • copy_number - a required double value
  • status - an optional string value
  • attributes - an optional string value representing a JSON object to store meta data
  • chromosome - an optional string value
  • start_position - an optional long value, representing start position of the chromosome
  • end_position - an optional long value, representing end position of the chromosome
  • interpretation - an optional string value
note

For any optional string values, N/A or . is acceptable to indicate a missing value. It is not required.

Copy number variant files can be processed using the following methods:

Structural Variants

Structural variants can be added to a project by uploading a CSV file that uses the following column schema:

sample_id,gene1,gene2,effect,chromosome1,start_position1,end_position1,chromosome2,start_position2,end_position2,interpretation,sequence_type,in-frame,attributes
sample1,MRGPRF,N/A,translocation,chr11,68773114,68773114,chr17,70134939,70134939,N/A,somatic,N/A,"{'sv_type':'TRANSLOCATION'}"
sample1,DHX34,FUT1,duplication,chr19,47861484,47861484,chr19,49255988,49255988,N/A,somatic,N/A,"{'sv_type':'DUPLICATION'}"
File Schema
  • sample_id - a required string value
  • gene1 - a required string value, N/A a logical substitute for when one does not exist
  • gene2 - a required string value, N/A a logical substitute for when one does not exist
  • effect - an optional string value
  • chromosome1 - an optional string value
  • start_position1 - an optional long value, representing start position of chromosome1
  • end_position1 - an optional long value, representing end position of chromosome1
  • chromosome2 - an optional string value
  • start_position2 - an optional long value, representing start position of chromosome2
  • end_position2 - an optional long value, representing end position of chromosome2
  • interpretation - an optional string value
  • sequence_type - an optional string value
  • in-frame - an optional string value
  • attributes - an optional string value representing a JSON object to store meta data ** NOTE: For any optional string values, N/A or . may be used to indicate a missing value is acceptable. However this is not required.

Structural variant files can be processed using the following methods:

Re-Ingesting

If any of the omics types have already been ingested for a test (remember, a unique test is identified by several fields: the file itself, the test Name and Test Type fields) then they will not be re-ingested. Only any non-ingested types will be processed. Also note, if for some reason you want to ingest data already ingested, then the optional field (reIngestFile|re-ingest-file) can be added to a request. The existing omics test will be used, but the file will be fully re-processed.

FHIR Resources

The following FHIR resources are generated as part of the variant processing workflow: