8 months ago 760 views

HISTORY - A practical introduction to bioinformatics and RNA-seq using Galaxy - 04-09-2023

General information

Date: 04-09-2023
Time: 09:00 - 13:00
Location: Online (Zoom)
Code of Conduct: coc

Before

Introduction: who is who?

Introduction to BioNT

Schedule for the workshop

Day Tutorial Instructor
1 - Monday From peaks to genes Teresa
2 - Tuesday QC Bérénice
2 - Tuesday Mapping Bérénice
3 - Wednesday Reference-based RNA-seq - Part I Teresa
4 - Thursday Reference-based RNA-seq - Part II Bérénice
5 - Friday Learning about one gene across biological resources and formats Lisanna
5 - Friday One protein along the UniProt page Lisanna

How will the workshop be run?

Why a specific setup for this workshop?

How will we do?

How to participate?

Ask your questions, raise issues, interact with us in this Document

In addition, to help you navigate this document, we followed the structure of the tutorial and included:

Let’s try now!

✏️ Hands-on: Set you up
❓ Are you on this HedgeDoc? (Add a + when done)
❓ Have you ever used Markdown? (Add a +)

Using Galaxy for the workshop

Why do we use Galaxy this week?

❓ Have you ever used Galaxy? (Add a +)

We will use Galaxy Europe and its training infrastructure (called TIaaS)

✏️ Hands-on: Register and log in on Galaxy Europe
❓ Did you register on Galaxy Europe? (Add a + when done)

Day 1 - Monday

Table of Contents

  1. Icebreaker
  2. Galaxy Introduction
  3. Tutorial: From peaks to genes
  4. Summary
  5. Feedback

Icebreaker

❓ Tell us about a recent First in your life.

This could be big or small, perhaps you bought a house for the first time or you tried a new restaurant in your city. Recent can be any time in the past year.

Galaxy Introduction

❓ Any questions regarding this introduction?

Q1: What about data protection?
A: In this link you can find some information about that https://galaxyproject.org/learn/privacy-features/

Q2: Will we work with Galaxy EU or org?
A: Galaxy Europe for this workshop

Q3:Is a commercial resource assignment service offered to speed up analytics? Not only for teaching purposes
A: Indeed Galaxy Europe/ORG are not only dedicated to teaching, but are used by researchers all around the world to perform their analysis. However, multiple alternatives (private/public) instances are also available. There’s also a public server hosted in Australia for example (usegalaxy.org.au)

Q4:how we visualise history as a workflow?
A: You will get an introduction to that with the tutorial later today - you can simply export a workflow from the history -

Q5: Will the data sets for this workshop be made available for those that want to follow up with the hands-on training?
A: Yes all data sets are made available, not only for this workshop but also after

Q6: Technical, question, about If I run a local galaxy server instance. Is it connected to the others galaxy instances?
A: Not, each server operates independently.

Q7:Can I install Galaxy on my local machine?
A: Yes you can. The Galaxy Training Network (GTN) provides multiple trainings dedicated to the configuration of your own Galaxy server https://training.galaxyproject.org/training-material/topics/admin/

Q8:Are we getting a certificate
A: we will provide the certificate once you have finished the post-workshop survey in the last day, and you let the organisers know that you would be interested in getting a certificate

❓ Are you back?

Tutorial: From peaks to genes

Pretreatments

✏️ Hands-on: Open Galaxy
❓ Are you finished with this section? Add a ‘+’ below
Your questions
✏️ Hands-on: Open the tutorial
  1. Go to Galaxy (done in previous Hands-on)
  2. Click on the hat in the top bar
  3. Navigate to the topic “Introduction to Galaxy Analyses”
  4. Click on the tutorial “From peaks to genes”
❓ Have you found the tutorial? (Add a + when done)
Your questions

Q1: https://usegalaxy.eu/libraries/folders/F596c752a08d6a88c/page/1 does not contain the tutorial ? Where could I find the tutorial itself ? ok thanks. Found it
A: It countains the data for the tutorial, not the tutorial itself. You can find the tutorial as explained above:

  1. Go to Galaxy (done in previous Hands-on)
  2. Click on the hat in the top bar
  3. Navigate to the topic “Introduction to Galaxy Analyses”
  4. Click on the tutorial “From peaks to genes”
✏️ Hands-on: Create history
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: is there naming restrictions for History?
A: Not that I know :) I know that you can even use unicode emojis. But I am asking around to confirm

Q2: About datatypes - no question. Link to share: https://training.galaxyproject.org/archive/2022-06-01/faqs/galaxy/datatypes_understanding_datatypes.html
A: Thanks!

✏️ Hands-on: Data upload
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:When we use the interval datatype, its always the first 3 columns known and its always chrom, start and end?
A: Yes it should be. You can find some additional information about how the dfferent datatypes are defined in Galaxy here https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/config/sample/datatypes_conf.xml.sample

Q2: what is the difference between aasign datatype and target datatype?
A: The “assign datatype” option allows to define the current datatype format, and “target datatype” allows the conversion between different formats.

✏️ Hands-on: Inspect and edit attributes of a file
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:If I want to contruct my own analysis, will be possible to use a most updated genome reference. even though the mapping was done with a previous genome?

A: Not, this is not possible. The genome used in the mapping step should match with the one used in the downstream analysis (note: a reason is that gene coordinates can change quite a lot between genome versions). But in Galaxy it is very easy to re-run the mapping with another reference genome version.

Q2:
A:thanks!

✏️ Hands-on: Data upload from UCSC
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:Can we upload data from databases other than UCSC? Thanks!
A: Yes, you just need to download the datasets (or just copy the URL from the database (e.g. GEO)) and upload them (or paste the link) in Galaxy by using the data uploader tool.

Q2: What do the names of the columns from 5 to the end mean in the ‘genes’ file?
A: Does this column explanation of a bed file helps: http://www.ensembl.org/info/website/upload/bed.html

Q3:Can you repeat the explanation about compare two files?
A:To compape files in Galaxy, please choose the enable/disable window manager at the top bar. Once this turn yellow, choose the dataset from your history by clicking on the “eye” icon. This will open the chosen dataset in the new window and not in Galaxy.

Part 1 Naive approach

✏️ Hands-on: View file content
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q: Cannot do side-by-side,
A: What is the issue you got? Probably you need to click in the square icon next to the bell (upper part).

Q: what is the tail means in the tool names?
A: Tail refers to the Unix tool “tail”, which allows to retrieve the end of files.

❓ While the file from UCSC has labels for the columns, the peak file does not. Can you guess what the columns stand for?
✏️ Hands-on: View end of file
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:Why do we use the select last tool, means why do we want to cut the ends? Ok thanks!
A: Select last tool is only to see how the end of the file looks like - it is not cutting something

❓ How are the chromosomes named?
❓ How are the chromosomes X and Y named?
✏️ Hands-on: Adjust chromosome names
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:is there any other way to go to replace text? how to go to the replace-text gui?
A: You can find it in the tool search bar under the name “Replace text in a specific column”. Alternatively, you can click on the tool name icon, if you opened the tutorial inside the Galaxy interface.

Q3:what is the purpose of &?
A: & is a placeholder for the find result of the pattern search

❓ How many regions are in our output file? You can click the name of the output to expand it and see the number.

Let’s come back at 11:50 (CEST)

❓ Are you back?

Analysis

✏️ Hands-on: Add promoter region to gene records
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:how do we familiarize with the tools?
A: In order to get experience/be able to identify the most important tools for each type of analysis, problably the best approach is to follow the trainings hosted in the Galaxy Training Network related with your topic of interest. In case there’s not training available for your specific scientific research field of interest, you can always request it to the Galaxy community :) You can post your request in the Gitter channe (https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im).

Additionally, you can use the tool search box in Galaxy and type in your query to check if the tools you are looking for are available and you can also use tool recommendatio feature hosted on Galaxy Europe to know what further tools are available for extending your analysis.

An overview for all tool available is here: https://usegalaxy-eu.github.io/tools.html

✏️ Hands-on: Change format and database
❓ Are you finished with this section? Add a ‘+’ below
Do you need help?

Please describe your issue

✏️ Hands-on: Find Overlaps
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:The number of regions from intersect is quite high, is that usual result from ChIP-seq analysis?
A: It is quite dependent of the protein of interest/experimental conditions of your analysis.

Do you need help?

Please describe your issue

✏️ Hands-on: Count genes on different chromosomes
❓ Are you finished with this section? Add a ‘+’ below
❓ Which chromosome contained the highest number of target genes?

Visualization

✏️ Hands-on: Fix sort order of gene counts table
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:My work is completed and fine, but i am wondering what did we sort exactly?
A: Gene counts in descending order

Do you need help?

Please describe your issue

✏️ Hands-on: Draw barchart
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: What is the chromosome name Zero?
A: The chr0 in the mm10 genome assembly refers to a placeholder chromosome, which correspond to unfinished or unplaced sequences.

Q2:How can I have all the chromosome names on the x axis?
A: You may need to extract a subset of data to have gene names as priting names of a dataset with over 10,000 rows would make the plot look cumbersome. You can extract a subset of data by filtering out on the basis of gene counts for example and then work on smaller subset. To have gene names, use Jupyter notebooks that might need a bit of programming to recalibrate/customize your plot.

As alternative, you could make use of https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ggplot2_histogram/ggplot2_histogram/3.4.0+galaxy0

Q3:where can i find my saved visualisations?
A: check in the top panel under “user” -> “visualizations”

Extracting workflow

✏️ Hands-on: Extract workflow
❓ Are you finished with this section? Add a ‘+’ below

Share your work

✏️ Hands-on: Share history and workflow
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1 If we share the history with a user, can the person run any parts or the workflow?
A: Could you provide some more details about the question?

General questions

❓ General questions regarding today

Q1: Is it possible to get the questions and answers from this document? It maybe usefull also for future reference!
A: Yes, it will be made available.

Q2: Will we be able to access this HedgeDoc file after the course ends? Or maybe save it as a PDF file?
A: We will share a document with everyting at the end

Q3:After the workshop, will we have the same access to galaxy platform/tools even if we are not affiliated to any research institution/SME right now?
A: Yes

Summary

Feedback

❓ One thing that was good about today
❓ One thing to improve
❓ Any other comments?

Day 2 - Tuesday

Table of Contents

  1. Welcome
    1. Repetition of the day before
  2. Slides: Quality Control
  3. Tutorial: Quality Control
  4. Slides: Mapping
  5. Tutorial: Mapping
  6. Summary
  7. Feedback

Welcome

Today about quality control and mapping (foundation of HTS analysis)

Location: Online (Zoom)

Repetition of the day before

❓ What do you remember from yesterday?
❓ Do you have a question from the day before?

Slides: Quality Control

Disclaimer: We will not go through the full slidedeck

❓ Any questions regarding this introduction to Quality Control?

Q1: The number of charachters of line 4 (quality) should be the same to the line 2?
A:Yes, the number of characters in the DNA sequences (line2) should be equal to all quality encoded characters (line4), one quality encoded character for each nucleotide.

Q2: This slide, the quality is too bad. Need we discard this data?
A: Not, in case of Oxford Nanopore reads, the average quality is usually quite low due to technical limitations. An alternative that you can use for Nanopore data is Nanoplot.

Q3: Where can I find more examples, about average quality is good and not good.
A: It is quite dependent on the technology used for generating the data. I would recommend to read about the different sequencing technologies, in order to get knowledge about the pro- and contra of each of them. Apart from that, most GTN trainings include a QC step which provide meaningful examples. Additionally to GTN tutorials you can also find examples in the FastQC documetation

Tutorial: Quality Control

✏️ Hands-on: Open the tutorial
  1. Go to Galaxy (done in previous Hands-on)
  2. Click on the hat in the top bar
  3. Navigate to the topic “Sequence analysis”
  4. Click on the tutorial “Quality Control”
❓ Have you found the tutorial? (Add a + when done)
Your questions

Q1: I only find the slide and data yesterday, what is wrong with me?
A: Please use this link to the QC tutorial: https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html

-If you navigated to ‘Sequening analysis’ section, you should see this displayed line above. Here you finde the slide deck as well as the handson section. By klicking on the ‘laptop’ icon you open the tutorial

Inspect a raw sequence file

✏️ Hands-on: Data upload
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1 the line beginning with (third line) + is always empty?
A: Yes, this line is completely useless in terms of information, but it is required for FASTQ files (standard format requirement).

✏️ Hands-on: Inspect the FASTQ file
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:Could u plz tell again how did u find 38 phred score for the G? Thanks
A: Here you can find the equivalences: Quality scores

❓ Which ASCII character corresponds to the worst Phred score for Illumina 1.8+?
❓ What is the Phred quality score of the 3rd nucleotide of the 1st sequence?
❓ What is the accuracy of this 3rd nucleotide?

Assess quality with FASTQE 🧬😎 - short reads only

✏️ Hands-on: Quality check (FASTQE)
❓ Are you finished with this section? Add a ‘+’ below
❓ What is the lowest mean score in this dataset?

Assess quality with FastQC - short & long reads

✏️ Hands-on: Quality check
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: What the triangular with ! stands for?
A: Qualty score 5 you can find the full documention here

Q2:Please can I have the data once more?
A: Sorry, what do you mean, to get the data again on Galaxy? You need to paste this link in the Uploader tool -> Paste Fetch data, and then click in Start.

Q3:What will happend if your galaxy plateform is full to 100 percent?
A: You would need to free some space or in cases where you perfome an analysis exciding your queto you should ask for more space. However this can not be always provieded.

❓ Are you back?
❓ Any questions regarding what we did until now?

Q1: On the per tile sequence quality, is it our fault for if there was a mistake in the process?
A: If you have a different sequencing quality than the results shown in the session, please check your input and see if you changed one of the parameters by mistake. In general, if you have bad sequencing quality data, please check back with the sequencing facility and check the lab that prepared the data for reasons of this bad quality.

❓ Is the speed fine
❓ Which Phred encoding is used in the FASTQ file for these sequences?
❓ How does the mean quality score change along the sequence?
❓ Is this tendency seen in all sequences?
❓ Why is there a warning for the per-base sequence content graphs?
❓ Your questions

Q1: Not super important but out of curiosity: is there a biological/technical reason why 16S DNA has this bias compared to RNA-seq?
A: It could be caused by 5’ truncated 16S rRNAs with 3’ poly(A) tails (which could explain why they are enriched in adenine). A Poly(A) tail structure is normally attached to the 3 ′ end of a mRNA molecule and generally believed to stabilize and protect RNA from degradation. I have not enought information about how the samples were obtained, but it is possible that the 16S rRNA has been purified by usign the poly(A) tail method, which also could contribute to adenine enrichment.

16S rRNA sequencing targets a specific region of the ribosomal RNA gene, the 16S subunit in the case of bacteria and archaea. This gene is highly conserved across these organisms. The bias observed in 16S rRNA sequencing is mainly due to its specificity for the 16S region. It focuses on a small portion of the genome, limiting the amount of information obtained about the whole transcriptome.

❓ Why is there a fail for the per sequence GC content graphs?
❓ How could we find out what the overrepreseented sequences are?
❓ Your questions

Q1:how we would know if the GC content is normal, contamination or bias? and what we could do about this? Is it acceptable up to a level?
A: The GC content depends on your organism. You would need to know experiment. Many factor contribute to the GC distribution; for example Archaea show much higher GC content, since it is involved in the stablization of DNA under hight temperature conditions (C stablish three hidrogen bonds with G, A only two with T). Other factor that can contribute are GC microsatellite distribution, which is taxa-specific. Usually, non-normal GC distribution indicates presence of contaminants or some degree of degradation in the samples.

Q2: Does all the values with the percantage more than 0.1% are the ones that are overrepresented? Thanks!
A: Yes, but they should be checked with the list of contaminants to find out what they really are.

Q3: What is blast?
A: BLAST (Basic Local Alignment Search Tool) is a widely used software tool in the field of bioinformatics. It is used for comparing sequences of biological molecules such as DNA, RNA, or protein to identify similarities and potential homologies. For more details, have a look at: https://blast.ncbi.nlm.nih.gov/Blast.cgi

Q4: The per tile sequence quality graph that appears to me is different from the one that appears in the tutorial. Is that because the new data (which actually has no red at all) is “better”? Thank you
A: Yes, it means that there’s not batch effect between your samples.

Q5: Could debris in the sample affect the quality of the sequencing?
A: Yes, contamination is one the artifacts that can affect the quality. It is possible to evaluate the potential contamination by making use of addtional tools, such as Diamond (faster alternative to BLAST).

Q6: 20 is the common value used to trim ?
A: Yes, it is usually the threshold, but indeed there’s not an “objective” reason that could explain why 20 and not 15 or 25. Also, it is dependent on the kind of sequences you have and your research goals. In this experiment, it has been set to 20 but in different experiments, this value may differ and dependent on many factors such as library preparation and so on.

Trim and filter - short reads

✏️ Hands-on: Improvement of sequence quality
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:This workflow we are following would be the same for scRNA-seq?
A: Yes, usually you follow similar steps. You can find more information about scRNA-seq analysis in the collection of Galaxy single cell trainings.

Do you need help?

Please describe your issue

❓ What % reads contain adapter?
❓ What % reads have been trimmed because of bad quality?
❓ What % reads have been removed because they were too short?

Processing multiple datasets

✏️ Hands-on: Assessing the quality of paired-end reads
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:How do you know which one is forward and which one reverse?
A: Usally it is stated by the sequence facility. It is a convetion to name the forward _1 and the reverse _2 at the end of the file name.

❓ What do you think about the quality of the sequences?
❓ What should we do?
✏️ Hands-on: Improving the quality of paired-end data
❓ Are you finished with this section? Add a ‘+’ below
❓ How many basepairs has been removed from the reads because of bad quality?
❓ How many sequence pairs have been removed because they were too short?

Let’s come back at 12:10 (CEST)

❓ Are you back?
❓ Any questions regarding what we did until now?

Q1: If you have different single-end seq libraries that belong together (e.g. WT and a knock out mutant) would you process them together in cutadapt but leave the option single end?
A: Yes. A recommended approach would be to create a collection of all the reads, pre-process all together and, after performing the trimming/QC evaluation, split the collection according the different experimental conditions.

Q3:Incase we do have a subset of data and public dataset (from different resources), would be process them at the same process to get a better normalized of batch data? Or we need to do the normalize/correction batch effect step at downstream? For example, data from patient is not easy to get one, so public data is seem to be the best.
any idea to reuse this data, that fit for my research study?
You mean try to advoid the different of all factors as much as posible? How about the acceptable of data that come from different sources, forexample, mapping rate, expression ratio/tpm.
This is indeed a very complex problem. You could try to check the expression of constitutive genes as kind of control. You would expect constitutive genes to have similar expression patterns between samples, even if they belong to different data sources.

A: It is usually a bad idea to make use of data from different resources, at least if you pretend to compare different experimental conditions. The reason is that artifacts associated to technical-differences could introduce too much noise. In order to analyze data from different resources you need to check carefully that same instruments/sequencers have been used, and also to evaluate the metadata provided by the data providers (usually they provide additional information about kits used for extract the samples, etc.)

❓ Is the speed fine

Slides: Mapping

Disclaimer: We will not go through the full slidedeck

❓ Any questions regarding this introduction to Mapping?

Q1: I understand that the alignment allows me to get our reading, probably a piece of the dna, to which positions of the pattern it corresponds.
A: Yes, mapping allows to get positions of your reads on the REF genome sequence.

Q2: But, which tool should be use? What are the variables to be considered? Choosing an aligner
A: Do you mean tool for mapping? There are several tools such as Bowtie2, RNA-STAR, etc that are used for mapping. Each of these tools have different parameters/variables to be considered depending on the kind of algorithm they use internally. Please have a look at these tools in Galaxy to find out more about these tools. One such tool would be used in today’s workshop for mapping.

A factor to take in account is the capacity of the mapper to account for spliced alignments (in case of RNAseq data, e.g. RNASTAR). On the other hand, if you are working with DNAseq data, BWA-MEM2 can work properly, as there’s not necessity to account for splice alignments.

You can check the paper suggestion in the prestations.

Q3: Format files (SAM BAM), used for results of mapping ?
A: Yes, those files store information about mapping. You can find more information here.

Q4: I guess for prokaryotic data you would use different mapping tools due to the different genome structure?
A: Yes (in case fo RNAseq data), and not (when using DNAseq data you use the same tools).

Tutorial: Mapping

✏️ Hands-on: Open the tutorial
  1. Go to Galaxy (done in previous Hands-on)
  2. Click on the hat in the top bar
  3. Navigate to the topic “Sequence analysis”
  4. Click on the tutorial “Mapping”
❓ Have you found the tutorial? (Add a + when done)

Prepare the data

✏️ Hands-on: Data upload
❓ Are you finished with this section? Add a ‘+’ below
❓ What is a reference genome?
❓ For each model organism, several possible reference genomes may be available (e.g. hg19 and hg38 for human). What do they correspond to?
❓ Which reference genome should we use?
✏️ Hands-on: Mapping with Bowtie2
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: In this example, could we use a different tool eg RNA STAR?
A: An alternative to Bowtie2 is BWA. We have DNA data, so you should privileged DNA mapper, instead of RNA ones like STAR or HISAT. You could use RNASTAR, but due to the differences in mapping algorithms, the results could differ in certain degree. Also RNASTAR requires much more computational resources (so will take longer).

Q2:Is it beneficial to use more than 1 mapping tool in order to get more relevant results at the end?Thanks
A: Not, it is usually unnecessary. Some tools allow to perform mapping in two-step mode (such as RNASTAR), which can be necessary in specific situations, such as the identification of new splicing sites.
What could be more benfical is to read up about paramters and ajust them to your data.

❓ What information is provided here?
❓ How many reads have been mapped exactly 1 time?
❓ How many reads have been mapped more than 1 time? How is it possible? What should we do with them?
❓ How many pair of reads have not been mapped? What are the causes?

Inspection of a BAM file

✏️ Hands-on: Inspect a BAM/SAM file
❓ Are you finished with this section? Add a ‘+’ below

General questions

❓ Questions

Q1: I would like to help with translations of the course tutorials into Spanish, but I have only found information on how to create new tutorials.
https://training.galaxyproject.org/training-material/topics/contributing/tutorials/create-new-tutorial/tutorial.html
Do you know, how can I do?
A: You could try to contact with Wendi Bacon wendi.bacon@open.ac.uk, she was involved in the translatons. Excellent. Thanks
A: We will translate the materials as a BioNT activity too, but later in time. Keep an eye on our social channels to stay posted about general translations of BioNT materials (beyond this specific tutorial).

Q2: Excuse me, probably the basic question. Is the mapping tool, the tool, that we can use to test filiations (father - son)
A: Do you want to use a tool like Deseq2 for differential evaluation or specifically for mapping?
I do not exaclty which tool, I think about two samples (not reference) and mapping both to find filiation between both.

Summary

Feedback

❓ One thing that was good about today
❓ One thing to improve
❓ Any other comments?

Day 3 - Wednesday

Table of Contents

  1. Welcome
  2. Repetition of the day before
  3. Slides
  4. Tutorial
  5. Summary
  6. Feedback

Welcome

Repetition of the day before

❓ What do you remember from yesterday?
❓ Do you have a question from the day before?

Slides: Transcriptomics

❓ Any questions regarding this introduction to Transcriptomics?

Q1: Why we not using gene level like TPM for downstream analysis? What the meaning of TPM calling step?
A: Because it misses some normalization, e.g. difference in library composition. We will cover that tomorrow

Q2:EdgeR and DESeq2 has its own normalisation method, that means that you dont need to do any normalisation beforehand?I mean you need raw data! is that correct?
A: Yes. We will see that tomorrow

Q3: I read some suggestion that Cufflink counting not correct when compare with other tools like featureCounts, STARcount is that right?
A: Cufflink is not the recommended way to go nowdays. We recommend featureCounts

Q4: If I perform variant calling using RNA seq, do i need to exclude SNVs present in 3 and 5’ UTR region?Will they have a higher chance to be false positive?
A: Not, it is usually not required. 5’ UTR mutations can impact for example in promoter activity, so still “useful” from biological point of view (e.g. https://pubmed.ncbi.nlm.nih.gov/23027126/). Also mutations in 3’UTR can affect gene expression level at different levels (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6267165/). Why do you expect higher false positives?Because they are not neccesry in the exonic region right? But they are still under selective pressure (e.g. 5’ mutations that block expression of essential genes are expected to be removed from the population). So, I think no more false positives would be found in those regions.Thank you for your help:) This is interesting regarding your question I think: Germline de novo mutation rates on exons versus introns in humans
.

Tutorial: Reference-based RNA-Seq data analysis

✏️ Hands-on: Open the tutorial
  1. Go to Galaxy (done in previous Hands-on)
  2. Click on the hat in the top bar
  3. Navigate to the topic “Transcriptomics”
  4. Click on the tutorial “Reference-based RNA-Seq data analysis”
❓ Have you found the tutorial? (Add a + when done)

Data upload

✏️ Hands-on: Data upload
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: If I search through shared library cannot find. The message This folder is either empty or you do not have proper access permissions to see the contents. If you expected something to show up please consult the library security wikipage

A: What did you search and where? (please guide me through the steps you did).
Search did Libraries
GTN - Material > Transcriptomics > then entered GSM461177
Did you enter the folder DOI...?
Where?
In GTN - Material > Transcriptomics > Reference-based RNA-Seq
Does that work now? No, I copy the files from the tutorial, but I cannot find if I use shared libraries
So it works with the URL? Yes
Good :)

❓ How are the DNA sequences stored?
❓ What are the other entries of the file?

Quality control

✏️ Hands-on: Quality control
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:Could u plz repeat why do we flatten our files? Thanks!
A: Unfortunately the current version of MultiQC (the tool we use to combine reports) does not support list of pairs collections. So we need to transform our the list of pairs to a simple list before running FastQC

Q2:Is it normal the data files have not uploaded since we started? It is taking longer than 15 minutes. Yes. I tried URL- Ohh i get it, thanks.
A: Are you using TiaaS? Did you get your data using URL or the shared data library? Using URL it can take some times sometimes if many people are doing the same. Please try using the data from the shared data library

Do you need help?

Please describe your issue

❓ What is the read length?
❓ What do you think of the quality of the sequences?
❓ What should we do?
❓ What is the relation between GSM461177_untreat_paired_forward and GSM461177_untreat_paired_reverse ?
✏️ Hands-on: Trimming FASTQs
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: Why we do not use the flattened dataset?
A: Because we want to have forward and reverse together, as in the paired collection

Q2:When you decide the paramters the Minimum length and the quality cutoff should be the same? so if we decide another min length we should change also the cutoff
A: There is no direct link between the mininum length and the quality cutoff. Miminum length is in bp and will depend on your input data, here we have quite short reads, so a small value. The quality cutoff correspond to a Phred score, so a value between 0 and 60.

Is there any value that is kind of common when you decide that cutoff?
Usually 20 is a good value for the quality cutoff (50 for minimum length size, if you have sequences around 100bp at least), but also depends on your sequence of interest. For example, when analyzing miRNA-seq samples, the limit should be stablished around 20 nts.

Q3:why did we choose single-end cutadapt shouldn’t it be the paired-end collection?
A: Yes, in cutadapt we need to use paired-end collection as you mentioned.

Do you need help?
❓ Why do we run the trimming tool only once on a paired-end dataset and not twice, once for each dataset?
❓ How many sequence pairs have been removed because at least one read was shorter than the length cutoff?
❓ How many basepairs have been removed from the forward reads because of bad quality? And from the reverse reads?

Mapping

❓ What is a reference genome?
❓ For each model organism, several possible reference genomes may be available (e.g. hg19 and hg38 for human). What do they correspond to?
❓ Which reference genome should we use?
✏️ Hands-on: Spliced mapping
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: What exactly means to compute coverage?
A: It compute the coverage, i.e. number of reads mapping at each bp.

Q2:If we want to find this genome that we used in gtf from the UCSC, would we follow the same steps that we did yesterday? Because I am not sure about some of the parameters when I use USCS
A: Yes you can use UCSC or other databases. You will probably need to read a bit more about the parameters to figure out which values to select
And it doesnt matter what source we are using as long as we use the same version? eg ensemble as a source of the genome

Q3:Will we get the same results if we use HISAT2 instead of RNA STAR? Which is the better choice?
A: Results would not be completely the same, but pretty similar. You can find a technical comparasion here Evaluation of Seven Different RNA-Seq Alignment Tools Based on Experimental Data from the Model Plant Arabidopsis thaliana. For example, according the paper, “STAR has a higher tolerance for more soft-clipped and mismatched bases compared to HISAT2, which leads to a higher mapping rate for STAR and more unmapped reads for HISAT2”.Thanks a lot!

Q4: Why are we using both the built in and the gtf file?
A: The built-in is only the FASTA sequence of the reference genome. The GTF contains informations about locations of the genes (information not found in the FASTA file). Using the GTF, we can count the number of reads mapped on each gene

Q5: For the “Length of the genomic sequence around annotated junctions” do you always take the read-length before trimming?
A: Yes. Because the trimming will make reads of different lengths, so hard to know the new length

Do you need help?

Please describe your issue

❓ Which information do you find in a SAM/BAM file?
❓ What is the additional information compared to a FASTQ file?
❓ Are you back?
❓ Any questions regarding what we did until now?

Q1: I may need to be shown again the visualization with IGV part if that’s possible and how to upload our bam files there if we have IGV downloaded in our PCs
- figured
- Great ;)

Q2: I do not find Sashimi Plot from the menu
A: It is not in the menu when you right click on the BAM file section (in IGV)?
Yes, in IGV but where is the BAM file section? I did click mouse right
The middle one. Thanks, I found it

❓ Is the speed fine

Counting the number of reads per annotated gene

❓ Look at Fig.19. How many reads are found for the different exons?
❓ Look at Fig.19. How many reads are found for the different genes?

Estimation of the strandness

Counting reads per gene

Follow STAR version of the protocol

✏️ Hands-on: Inspect STAR output
❓ Are you finished with this section? Add a ‘+’ below
❓ How many reads are unmapped/multi-mapped?
❓ At which line starts gene counts?
❓ What are the different columns?
❓ Which columns are the most interesting for our dataset?
✏️ Hands-on: Reformatting STAR output
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:How to proceed with STAR if Infer Experiment gives mixed results - some are unstranded, other - stranded?
A: A mixed reult means you have a unstaranded library. Witin the tutorial you can expand a section, which gives you an explainietion how to interpret your Infer Experiment results.

✏️ Hands-on: Getting gene length
❓ Are you finished with this section? Add a ‘+’ below
❓ Which feature has the most counts for both samples? (Hint: Use the Sort tool)

General questions

❓ General questions

Q1:Is there a possibility that we are sent TiaasS again?
A: If you are registerd once you are assigned the whol week (We can share it tomorrow morning)
- after the workshop, what can we do to have a faster process without TiaasS?
- You can read our nice PHD commics. Usally runs do not take that long. If we are in a Workshop there are many jobs run at once, which is not that much of the case in the day by day work.

Q3:Can we use STAR for lncRNA expression? Is it suitable?
A: Yes you can. If you do have an specific experiment you could also ask for feedback in the galaxy community.

Summary

Feedback

❓ One thing that was good about today
❓ One thing to improve

Day 4 - Thursday

Table of Contents

  1. Welcome
  2. Repetition of the day before
  3. Tutorial
  4. Summary
  5. Feedback

Welcome

Repetition of the day before

❓ What do you remember from yesterday?
❓ Do you have a question from the day before?

Tutorial: Reference-based RNA-Seq data analysis Part 2: Analysis of the differential gene expression

✏️ Hands-on: Open the tutorial
  1. Go to Galaxy (done in previous Hands-on)
  2. Click on the hat in the top bar
  3. Navigate to the topic “Transcriptomics”
  4. Click on the tutorial “Reference-based RNA-Seq data analysis”
❓ Have you found the tutorial? (Add a + when done)

Identification of the differentially expressed features

✏️ Hands-on: Import all count files
❓ Are you finished with this section? Add a ‘+’ below

Shared history: https://usegalaxy.eu/u/berenice/h/ref-based-rna-seq---part-2---070923

Your questions

Q1:Should I add them as datasets or as data collection?
A: Add them as datasets.

Q2: how we automitize it?
A: https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html
In the Galaxy file uploader, please choose the “collection” tab instead of “regular” for automatic creation of a collection of (to be) uploded datasets.
For more details about data uploading in Galaxy, have a look at: https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/upload-rules/tutorial.html

Q3:If I look into my GSM files in the collection, it looks different than the ones from your files. The first line is something with GeneID and the name of the GSM file.
A: That is the header of the file. The actual data starts from the second line. --> that is clear, but why this header is not in your file? Will this create problems during the following steps? If you have header, then this option in tutorial explainig DeSeq2 section will be used - “Files have header?”: Yes–>okay

Q4:And should we have also technical replicates?
A: A biological replicat will give you the real/biolocial variance therefore if you have the money go for biological replicates. With tecnical replicates you can teste your experiment setup (technicaly).

❓ Any questions regarding normalization?

Q1: so the paired-end sequencing is prefered - are there any drawbacks associated with it? (sample preparation, costs,…)
A: The cost is the main drawback, but regarding experimental preparation, requirements are not higher than when compared with single-end samples.

✏️ Hands-on: Add tags to your collection for each of these factors
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: So the most important is to have correct names initially in the files? Or is there any way to change names automatically?
A: It is not important what name you give to a file (in general). More important is is that you have the correct tags. Since we are extracting the tags now form the names, in this case it is important that you have the correct pattern for the pattern recognition.

Q2:Is there any other way to add tags (apart from manually )?
A: For any datasets, you can manually add tags by clicking just below their names in a history and using tag names such as #tag-name. But, for adding tags for items in a collection, you will have to follow the approach mentioned in the tutorial. Otherwise, it would be difficult to add tags for 100s of items in a collection manually.

Q3:does it matter if the tags are in different order? Sometimes it shows single and paired first and sometimes treated and untreated.
A: No, it should not matter. Entities single-treated and treated-single mean the same thing.

Q4:These tags are the ones galaxy will use to show me the results for the differential expression, right?
A: Yes, basically DESeq makes use internally of the tags in order to organize the samples in groups.

✏️ Hands-on: Determine differentially expressed features
❓ Are you finished with this section? Add a ‘+’ below
❓ Are you back?
❓ Any questions regarding what we did until now?

Q1: I have a general question: if we have the counts in one table, is there any way to split that table in a format regognisable by DESeq2?
A: COuld you provide more details? How did you generate the table?
In my case, I have directly the counts all in one table provided by our RNA facility. They calculated using STAR. Probably they made use of a similar software to FeatureCounts, but it is important to know which specific method was used for generating the table. Otherwise it is not possible to garantee that the statistical analysis performed by DESeq2 provide meaningful information (due or example to different quantification measures: RPKM, FPKM, TPM, etc.). DESeq2 requires normalized counts.

Is DESeq requires normilised counts? We used the raw counts from the STAR (here also in our tutorial), or I didnt understand correctly? You are right sorry, DESeq2 takes raw data and internally performs the normalization.

What is the format the sequencing facility provided you?
They performed quality control, mapping and counting using STAR. So they return an excel table with the raw counts per sample. For example in EdgeR you can use this table as an input, so I was just wondering if that possible in DESeq2 tool. I never performed the conversion, but I know that technically it is feasible.

Q2: Could you why using technical replicates provied more details?

A: Technical replicates can help improve experimental variation. By replicating the same sample multiple times, researchers can gain a more accurate understanding of the natural variation in the data. This can help in distinguishing true biological changes from random noise in the data.

Q3: if we have multiple conditions how the results are affected if you calculate them separately compared to have all the parameters in DESeq2 and calculate all the conditions in parallel?

A: By analyzing together you can analyze how factors interact each other. And by this, you find the differentially expressed genes between multiple factors.

Q4: Seem that DESeq2 with TPM can resolve the problem of different sequencing facilities and depth of sequencing read, so we can using it for compare the same condition with different data resources (like my own data and public data)? Yes it is, both are same method, same tissue sources, like cancer type, for example. I would like to know what the TCGA/METABRIC or other public data did with collaboration?
Do we need any tools to correct the batch effect, or only DESeq2 is enough, please? If yes, the batch effect removal (such as Combat-seq, limma:removeBatchEffect) will consider before or after DESeq2 analysis is the best?
A: You would like to compare your own results with public data but both generated with the same method?
If you have exacly the same experiment but just done in a different lab you could use it together but enter the different ‘location’ of preparations as a factor. Deseq2 will factor out variabilities occuring because of that it may be e.g. different people preparing the samples (batch effects).

Q5:Is it possible to change the appearence of the plots?
A: you could change the Alpha value for MA-plot. The plots are generated by Deseq.

If you feel confortable with R, you could also contribute to the Galaxy tool wrapper https://github.com/galaxyproject/tools-iuc/blob/main/tools/deseq2/deseq2.R :)

❓ Is the speed fine
❓ What is the first dimension (PC1) separating?
❓ And the second dimension (PC2)?
❓ What can we conclude about the DESeq design (factors, levels) we choose?
❓ How are the samples grouped?
❓ Is the FBgn0003360 gene differentially expressed because of the treatment? If yes, how much?
❓ Is the Pasilla gene (ps, FBgn0261552) downregulated by the RNAi treatment?
❓ We could also hypothetically be interested in the effect of the sequencing (or other secondary factors in other cases). How would we know the differentially expressed genes because of sequencing type?
❓ We would like to analyze the interaction between the treatment and the sequencing. How could we do that?
✏️ Hands-on: Annotation of the DESeq2 results
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1: Is there a way that we don’t loose the column names from DESeq result table after annotating?
A: The column names will be readded in the next step

Q2: My order after the annotation is not exactly the same. At least the first one is a different gene
A: Do the overall number of genes match in your dataset? Annotating is just an extension of the previous datasets generated by DESeq2.

Q3:So, Over-expressed means Up-regulated, those with positive Fold change? And Down-regulated are those with negative Fold change? What about those with value Zero(0)?
A: yes, over-expressed are up-regulated which means genes are more active (+ fold change). Negative fold change shows down-regulation (less active). The values 0 denote no change - neither up- nor down-regulation.

❓ Where is the most over-expressed gene located?
❓ What is the name of the gene?
❓ Where is the Pasilla gene located (FBgn0261552)?
✏️ Hands-on: Add column names
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:The output of the annotate tool its always the same, right?So if we run DESeq2 in our data we can still add the same header?

A: yes, you can copy the header from this tutorial and repeat the Add colum names for your own data.

✏️ Hands-on: Extract the most differentially expressed genes
❓ Are you finished with this section? Add a ‘+’ below
❓ How many genes have a significant change in gene expression between these conditions?
❓ How many genes have been conserved?
❓ Can the Pasilla gene (ps, FBgn0261552) be found in this table?
❓ Are you back?
❓ Any questions regarding what we did until now?

Q1: Why we get different results above? The same data, the same tools, I hope - the same options, but results are different…
A: Could you check your seetings again? If you can not find your error you can also share the history with teresa-m@t-online.de so we can have a look.Thanks!
AA: Do you use p-value or p-adjusted? Maybe a potential source of confusion?
AAA: The differences within the Zenodo and shared data files could be also your issue.

Q2:Based on the question in the DESeq2, I run again by combining the tags treated and PE , untreated and PE , treated and SE and untreated and SE, but ended up in error. It was just because we didn’t have enough replicates? Is it feasible to combine 2 tags in one factor level?
A: It is required that you have at least 1 replicate (i.e. >=2 inputs per condition). If you dont have enough observations (i.e. separate fastqs), you can reduce the number of factors in your model so that the intra-group variation is calculable. E.g. Germans, Danish, Swedish, Chinese, Japanese, Korean -> Europeans, East Asians. This simplified model will inevitably have a lower resolving power.

Q3: I have another general question: If we want to do an analysis and we are not sure how we can perform it in Galaxy, which tool we can use or which series of tools, what you could suggest to do?
A: You can ask about how to implement an especific analysis in the Galaxy Help forum

✏️ Hands-on: Extract the normalized counts of the most differentially expressed genes
❓ Are you finished with this section? Add a ‘+’ below
✏️ Hands-on: Plot the heatmap of the normalized counts of these genes for the samples
❓ Are you finished with this section? Add a ‘+’ below
✏️ Hands-on: Plot the Z-score of the most differentially expressed genes
❓ Are you finished with this section? Add a ‘+’ below

I am reviewing my last steps but I have an error when I try to generate my heat maps

loc <- S

Your questions

Q1: If we want to add the gene names, how can we add this information?
A: Yes, you can use “Labeling columns and rows” option in the “heatmap2” Galaxy tool.

Q2:Is there a possibility to modify and edit the heatmaps? If you have for example a lot of samples, the heatmap can be really crowded.
A: Yes, this is generated by an R script, and possiblities are: a) You know some programming, and you can perhaps edit the script in Rstudio optimised for your needs. You could try to modify the script in here and launch an interactive tool, for instance: https://github.com/galaxyproject/tools-iuc/blob/main/tools/heatmap2/heatmap2.xml b) Some simple things can be done before feeding the data into the tool, such as filtering the list manually. The heatmap2’s output is unfortunately a static image (png) and it is not possible to interactively manipulate it.

To elaborate a), Galaxy offers several interactive tools such as Jupyter notebooks, R Studios (such as https://usegalaxy.eu/?tool_id=interactive_tool_jupyter_notebook&version=latest). Using such tools, you cna directly import your tables from a Galaxy history to these tools and create your customized plots using packages such as Bokeh, Seaborn or Matplotlib.

Gene Ontology analysis

✏️ Hands-on: Prepare the first dataset for goseq
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q1:Could you repeat what the float means?
A: A float is 4.05 (precision upto 2 digits after decimal) while an int is 4 (with no decimal). Float and Int are data types.

Q2:why we changed in uppercase based on the info of the tool, right?
A: Yes, using “Change Case” tool. It is required by the tool to work properly.

✏️ Hands-on: Prepare the gene length file
✏️ Hands-on: Perform GO analysis
Your questions

Q1: I did not fully understand why go term analysis requires the gene length if we directly specify if something is significantly regulated or not?
A: Depening on the length of the gene statisticly you could have more reads mapping to it. Therefor you would need to consider the length of the gene also for the GO-Term analysis.

Q2:Is it possible to generate a report that contains all the grpahs from the analyses that we have done?
A: Unfortunately, this is no tool at the moment to generate all plots in a report.

Q3:Can we create heatmaps only for the DEGs? In that case we will only need to filter the normalised data based on that information?
A:Not sure if I understood your question. You can plot a heatmap with ever data you want you just would need to create according input correctly.

Q4:If we use goseq tool in Galaxy and the GO analysis from the website, I imagine that we expect some differences?

A: If you use the same versions of tools in Galaxy and its original software on the same dataset, I would not expect differences.

Summary

Feedback

Thanks!!

❓ One thing that was good about today
❓ One thing to improve
❓ Any other comments?

Day 5 - Friday

Table of Contents

  1. Welcome
  2. Questions on the first 4 days
  3. Icebreaker
  4. Tutorial
  5. Summary
  6. Feedback

Welcome

Questions on the first 4 days

❓ Do you have a question from the days before?

Q1: Can we repeat the last step from yesterday (the part on how to get the gene length file for goseq)? That was a bit fast at the end – yes, thank you :)
A: You can toggle the “Create gene-length file” option in featureCounts (or an equivalent tool) to get it reported as a seperate file.
AA: There are two possibilities to get the gene lengthes. The first one is as described above, select the output within the feature counts tool if you used this one for quantification. Since we used the STAR output gene counts we had to use a tool called Gene length and GC content. Here the annotation (gtf file) is used to create a table containing the gene length, which is needed for the GO-seq analysis.

Icebreaker

❓ What is your most used emoji?

Now that you became a Markdown expert, you can use emojis too! Just go here and copy the one you use most below. :wink:

Bioinformatics Data Types and Databases

❓ Any questions regarding this introduction?

Q1: FASTQ. Fist line contain some metadata
A: Is this a question? Could you be more specific?+
AA: To clarify: For .fastq, the data is organised as 4-line blocks. First line of each 4-line block contains some metadata describing the block (i.e.1 read). Ok

Q2: Are the biological data formats unified across different databases or do different databases use their own specific formats?
A: They are portable to a certain degree, especially the commonly used ones. However, it does happen that, say, the chromosome naming conventions differ (1, chr1, chrI, Chr1, …), while the file format is same per se. Hiccups do happen.

Q3: Is there a difference between bed and wig/bigwig files?
A: BedGraph example: http://genome.ucsc.edu/goldenPath/help/bedgraph.html
A: Wig example: http://genome.ucsc.edu/goldenPath/help/wiggle.html
AA: https://genome.ucsc.edu/goldenPath/help/bigWig.html (seems binary, whereas bed is essentially a plain text tsv)
A: These types of data can be converted to each other, but they “occupy a different amount of space” due to our they are designed

Q4: Can you talk about graphical databases. No, about graph nodes
References: https://en.wikipedia.org/wiki/Graph_database
I am working with Neo4j, but I am interested to know, if is used this or another graph database specifically with bioinformatics. If yes, which
A: What type of data are you working with? Indexing Research papers, in this moment, I am indexing Metadata in solr and export to Neo4j, to navigate it
A: All right, most of the graph data formats that exist in bioinformatics (e.g. the data in GO) refer to biological entities and not literature information. I am not aware of any file format developed in particular for biological literature information, but maybe it’s worth to have a look at Europe PMC (https://europepmc.org/)
Go is this ? https://en.wikipedia.org/wiki/Gene_Ontology

Q5: Do you know whether ChapGPT/BioGPT can already access the biological data in databases or if you think it can be useful in near future?(how?)
A: I dont know, but think they dont scan raw biological data for their training. It would take a lot of resources to include all known genomes, for instance. Though, I am aware of research projects in which they re-purpose these underlying algorithms, but the training set is exclusively bio datasets.
A: i think this could be relevant: https://www.sib.swiss/news/bringing-meaning-to-biological-data-knowledge-graphs-meet-chatgpt

Q6: Graph database not only to literature, can help to navigate through proteins, structure but I do not know if exists something like this.

Box to add for every break

Let’s come back at 10:00 (CEST)

❓Are you back?

❓ Is the speed fine

Tutorial: One gene across biological resources and formats

✏️ Hands-on: Open the Genome data viewer

https://www.ncbi.nlm.nih.gov/genome/gdv

Searching Human Opsins

✏️ Hands-on: Searching Human Opsins
❓ Are you finished with this section? Add a ‘+’ below

In the Genome Data Viewer:

❓ How many hits did you find in Chromosome X?
❓ How many are protein coding genes?
✏️ Hands-on: Hands-on: Open Genome Browser for gene OPN1LW

In the Genome Browser

❓ Are you finished with this section? Add a ‘+’ below
❓ What is the location of the OPN1LW segment?
❓ What are introns and exons?
❓ How many exons and introns are in the OPN1LW gene?
❓ What is the lenght of the protein in number of amino acids?
✏️ Hands-on: Open Genome Browser for gene OPN1LW
❓ Are you finished with this section? Add a ‘+’ below
❓ What is the first AA of our protein product?

Finding more information about our gene

✏️ Hands-on: Go to a specific position in Sequence View

Start with the NCBI search.

❓ Are you finished with this section? Add a ‘+’ below
❓ Can you guess which type of conditions are associated to this gene?
✏️ Hands-on: Open OMIM and Read as much as your interest dictates
❓ Are you finished with this section? Add a ‘+’ below
❓ What is the clinical significance of the rs5986963 and rs5986964? Any difference with the functional consequence of rs104894912? And what is the functional consequence of rs104894913?
✏️ Hands-on: Open Protein

Back to the NCBI search

❓ Are you finished with this section? Add a ‘+’ below
✏️ Hands-on: Download the protein sequences
❓ Are you finished with this section? Add a ‘+’ below
Is my pace ok?
❓ What does the folder contain?
❓ Do you think they implemented good data practices?

Searching by sequence

✏️ Hands-on: Search the protein sequence against all protein sequences
❓ Are you finished with this section? Add a ‘+’ below
Your questions

Q: I have copied, made search but 0 results with this. I will try again. but problem when I copy this content.

NP_064445.2 OPN1LW [organism=Homo sapiens] [GeneID=5956]
MAQQWSLQRLAGRHPQDSYEDSTQSSIFTYTNSNSTRGPFEGPNYHIAPRWVYHLTSVWMIFVVTASVFT
NGLVLAATMKFKKLRHPLNWILVNLAVADLAETVIASTISIVNQVSGYFVLGHPMCVLEGYTVSLCGITG
LWSLAIISWERWMVVCKPFGNVRFDAKLAIVGIAFSWIWAAVWTAPPIFGWSRYWPHGLKTSCGPDVFSG
SSYPGVQSYMIVLMVTCCIIPLAIIMLCYLQVWLAIRAVAKQQKESESTQKAEKEVTRMVVVMIFAYCVC
WGPYTFFACFAAANPGYAFHPLMAALPAYFAKSATIYNPVIYVFMNRQFRNCILQLFGKKVDDGSELSSA
SKTEVSSVSSVSPA

A: I copy-pasted your sequence, and it returned many hits.

Finally run, but I made the same steps and return results.

In BLAST:

✏️ Hands-on: Graphic Summary of the protein sequences
❓ Are you finished with this section? Add a ‘+’ below
❓ What is the first hit? Is it expected?
❓ What are the other hits? For which organisms?
❓ Are you finished with this section? Add a ‘+’ below

More information about our protein

✏️ Hands-on: Searching and Open results on UniProt
❓ Are you finished with this section? Add a ‘+’ below

Summary

❓ Are you back?
❓ Any questions regarding what we did until now?

Q1: Why if I use filter Human, the results is different (5) instead exactly human OPN1LW 7+
But if you use filter Human (left) Ok, it is clear.
A: I checked for “human” in the bovine record, does not appear anywhere visible. Maybe a hidden field?

Of course depends, the aggregate field configurated to do the search.

❓ Is the speed fine

Tutorial: One protein along the UniProt page

✏️ Hands-on: Search for Human opsin on UniProtKB

In the UniProt

❓ Are you finished with this section? Add a ‘+’ below
✏️ Hands-on: Open a result on UniProt

In the P04000 entry page:

❓ Are you finished with this section? Add a ‘+’ below

Entry

❓ What are the available formats in the Download drop-down menu?
❓ What type of information would we download through these file formats?

Names and Taxonomy

❓ What is the taxonomic identifier associated with this protein?
❓ What is the proteome identifier associated with this protein?

Subcellular location

❓ Where is our protein in the cell?
❓ Is it coherent with the GO annotation observed before?
❓ How many Transmembrane domains and Topological domains are there?

Disease & Variants

❓ What types of scientific studies allow to assess the association of a genetic variant to a diseases?

PTM/Processing

❓ What are Post-translational modifications for our protein?
✏️ Hands-on: Open a result on UniProt

Search for Human OPN1LW on STRING DB
In the STRING page:

❓ Are you finished with this section? Add a ‘+’ below
❓ How many different file formats can you download from there?
❓ What kind of information will be conveyed in each file?

Structure

Back to the P04000 entry page

❓ What is the variant associated to Colorblindess?
❓ Can you find that specific amino acid in the structure?
❓ Can you formulate a guess of why this mutation is distruptive?

❓ Questions

Q1: Will you be issuing certificates?
A: Please contact us if you want to have a certificate (contact@biont-training.eu)

Q2:Is it possible to receive a certificate?
A: Please contact us if you want to have a certificate (contact@biont-training.eu)

Summary

Feedback

Q1: Your contact details (email) please.
A: contact@biont-training.eu
Or check our homepage: http://biont-training.eu/

❓ One thing that was good about today
❓ One thing to improve
❓ Any other comments?

Survey: https://survey.bio-it.embl.de/678593?lang=en