261 views
 owned this note
# HISTORY - A practical introduction to bioinformatics and RNA-seq using Galaxy - 04-09-2023 ## General information **Date**: 04-09-2023 **Time**: 09:00 - 13:00 **Location**: Online ([Zoom](https://epfl.zoom.us/j/63351787229?pwd=RmM5c2RzSzVrTmswb2ludHpJOUptQT09)) **Code of Conduct**: [coc](https://galaxyproject.org/community/coc/) ### Before Introduction: who is who? Introduction to BioNT ### Schedule for the workshop | Day | Tutorial | Instructor | | -------- | -------- | -------- | | 1 - Monday | [From peaks to genes](https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-peaks2genes/tutorial.html) | Teresa | | 2 - Tuesday | [QC](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) | Bérénice | | 2 - Tuesday | [Mapping](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html) | Bérénice | | 3 - Wednesday | [Reference-based RNA-seq](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html) - Part I | Teresa | | 4 - Thursday | [Reference-based RNA-seq](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html) - Part II | Bérénice | | 5 - Friday | [Learning about one gene across biological resources and formats](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/online-resources-gene/tutorial.html) | Lisanna | | 5 - Friday | [One protein along the UniProt page](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/online-resources-protein/tutorial.html) | Lisanna | ### How will the workshop be run? Why a specific setup for this workshop? - Bridging Acedemia and SME - Importance of privacy How will we do? - Zoom with panel view - Panelist: instructors & helpers - Only trainers will be visible - No personal data will be displayed - This [HedgeDoc](https://biont.biobyte.de/AZptJADHQBusn0t6tpBL5w?both#) document in Markdown for interactions - Markdown: lightweight markup language - [Documentation](https://biont.biobyte.de/features#Edit) ### How to participate? **Ask your questions, raise issues, interact with us in this Document** In addition, to help you navigate this document, we followed the structure of the tutorial and included: - Each Hands-on section (✏️ - where you will have to work) of the tutorial, including a part to ask questions or post issues you might face :::warning ✏️ Hands-on: Topic ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++ - Waiting for the job to be done: - Need help: + https://github.com/fastqe/fastqe#scale ##### Your questions Q1: I don't markdown? A: ##### Do you need help? Please describe your issue - ::: A helper will help you - Question sections (❓ - where we ask you something ) for answering :::success ❓ We have a question: What colour has the sky today? - ::: Let's try now! :::warning ##### ✏️ Hands-on: Set you up - Access this HedgeDoc main document: rb.gy/1c9pc - Fill the following questions ##### ❓ Are you on this HedgeDoc? (Add a + when done) - Yes: +++++++++++++++++++++++++ - No (please sent a e-mail to one of the helpers) ::: :::success ##### ❓ Have you ever used Markdown? (Add a +) - Yes ++++++ - No ++++++++++=++++++ - What is Markdown?+++++ ::: ### Using Galaxy for the workshop Why do we use Galaxy this week? :::success ##### ❓ Have you ever used Galaxy? (Add a +) - Yes++++++++++++ - No ++++++++++++ - What is Galaxy?++ ::: We will use Galaxy **Europe** and its training infrastructure (called TIaaS) :::warning ##### ✏️ Hands-on: Register and log in on Galaxy Europe - Register on [Galaxy Europe](https://usegalaxy.eu/) - Connect to Galaxy Europe using the [TIaaS link](https://usegalaxy.eu/join-training/biont-bioinfo-23) ##### ❓ Did you register on [Galaxy Europe](https://usegalaxy.eu/)? (Add a + when done) - Yes+++++++++++++++ - No ##### ❓ Did you connect using the [TIaaS link](https://usegalaxy.eu/join-training/biont-bioinfo-23)? (Add a + when done) - Yes++++++++++++++++++++++ - No++++ - Do you need help? ::: ## Day 1 - Monday ### Table of Contents 1. [Icebreaker](#Icebreaker) 2. [Galaxy Introduction](#Galaxy-Introduction) 3. [Tutorial: From peaks to genes](#Tutorial-From-peaks-to-genes-httpstraininggalaxyprojectorgtraining-materialtopicsintroductiontutorialsgalaxyintropeaks2genestutorialhtml) - [Pretreatments](#Pretreatments) - [Part 1 Naive approach](#Part-1-Naive-approach) - [Part 2 More sophisticated approach](#Part-2-More-sophisticated-approach) - [Share your work](#Share-your-work) 4. [Summary](#Summary) 5. [Feedback](#Feedback) ### Icebreaker :::success ##### ❓ Tell us about a recent *First* in your life. This could be big or small, perhaps you bought a house for the first time or you tried a new restaurant in your city. Recent can be any time in the past year. - Got Master in Infection Biology Degree - Visited czech city Znojmo - writing thesis - First trip with a racing bike - I drank Kwass (russian drink) by first time - I tried to use inliner/skaters for the first time - Bought a Jacuzzi0 - First visit to Sicily - Went alone on a city trip - hedgeDoc first time - Bought my first house - Saw "Breaking Bad" - Visited the Danube delta - Travel to Galapagos - My daughter just got 3 - went on a boat trip - got married - Presented a poster on a conference - Graduated as a Microbiologist ::: ### Galaxy Introduction - [Slides](https://training.galaxyproject.org/training-material/topics/introduction/tutorials/introduction/slides.html#1) :::success ##### ❓ Any questions regarding this introduction? Q1: What about data protection? A: In this link you can find some information about that https://galaxyproject.org/learn/privacy-features/ Q2: Will we work with Galaxy EU or org? A: Galaxy Europe for this workshop Q3:Is a commercial resource assignment service offered to speed up analytics? Not only for teaching purposes A: Indeed Galaxy Europe/ORG are not only dedicated to teaching, but are used by researchers all around the world to perform their analysis. However, multiple alternatives (private/public) instances are also available. There's also a public server hosted in Australia for example (usegalaxy.org.au) Q4:how we visualise history as a workflow? A: You will get an introduction to that with the tutorial later today - you can simply export a workflow from the history - Q5: Will the data sets for this workshop be made available for those that want to follow up with the hands-on training? A: Yes all data sets are made available, not only for this workshop but also after Q6: Technical, question, about If I run a local galaxy server instance. Is it connected to the others galaxy instances? A: Not, each server operates independently. Q7:Can I install Galaxy on my local machine? A: Yes you can. The Galaxy Training Network (GTN) provides multiple trainings dedicated to the configuration of your own Galaxy server https://training.galaxyproject.org/training-material/topics/admin/ Q8:Are we getting a certificate A: we will provide the certificate once you have finished the post-workshop survey in the last day, and you let the organisers know that you would be interested in getting a certificate ::: :::success ##### ❓ Are you back? - Yes+++++++++++++++++++++++++++++ - No ::: ### Tutorial: [From peaks to genes](https://training.galaxyproject.org/training-material/topics/introduction/tutorials/galaxy-intro-peaks2genes/tutorial.html) #### Pretreatments :::warning ##### ✏️ Hands-on: Open Galaxy ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++++++++++ - Need help: ##### Your questions ::: :::warning ##### ✏️ Hands-on: Open the tutorial 1. Go to [Galaxy](https://usegalaxy.eu/) (done in previous Hands-on) 2. Click on the hat in the top bar 3. Navigate to the topic "Introduction to Galaxy Analyses" 4. Click on the tutorial "From peaks to genes" ##### ❓ Have you found the tutorial? (Add a + when done) - Yes: ++++++++++++++++++++++++++ - No: ##### Your questions Q1: https://usegalaxy.eu/libraries/folders/F596c752a08d6a88c/page/1 does not contain the tutorial ? Where could I find the tutorial itself ? ok thanks. Found it A: It countains the data for the tutorial, not the tutorial itself. You can find the tutorial as explained above: 1. Go to [Galaxy](https://usegalaxy.eu/) (done in previous Hands-on) 2. Click on the hat in the top bar 3. Navigate to the topic "Introduction to Galaxy Analyses" 4. Click on the tutorial "From peaks to genes" ::: :::warning ##### ✏️ Hands-on: Create history ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++++++++++++ - Need help: ##### Your questions Q1: is there naming restrictions for History? A: Not that I know :) I know that you can even use unicode emojis. But I am asking around to confirm Q2: About datatypes - no question. Link to share: https://training.galaxyproject.org/archive/2022-06-01/faqs/galaxy/datatypes_understanding_datatypes.html A: Thanks! ::: :::warning ##### ✏️ Hands-on: Data upload ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++++++++++ - Waiting for the job to be done: +++ - Need help: ##### Your questions Q1:When we use the interval datatype, its always the first 3 columns known and its always chrom, start and end? A: Yes it should be. You can find some additional information about how the dfferent datatypes are defined in Galaxy here https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/config/sample/datatypes_conf.xml.sample Q2: what is the difference between aasign datatype and target datatype? A: The "assign datatype" option allows to define the current datatype format, and "target datatype" allows the conversion between different formats. ::: :::warning ##### ✏️ Hands-on: Inspect and edit attributes of a file ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++++++++++++ - Need help: ##### Your questions Q1:If I want to contruct my own analysis, will be possible to use a most updated genome reference. even though the mapping was done with a previous genome? A: Not, this is not possible. The genome used in the mapping step should match with the one used in the downstream analysis (note: a reason is that gene coordinates can change quite a lot between genome versions). But in Galaxy it is very easy to re-run the mapping with another reference genome version. Q2: A:thanks! ::: :::warning ##### ✏️ Hands-on: Data upload from UCSC ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++++++++++ - Waiting for the job to be done: - Need help: - I am at the table browser. Besides changing to mouse, what other changes should we make? - - please read the tutorial, all parameters are described :) - My genome ref has 55419 regions, seem to be more than tutorial? => +I am wrong at chosing the refseq. Thanks - Did you specify the region correctly? Also, please, check if the selected assembly version is correct. ##### Your questions Q1:Can we upload data from databases other than UCSC? Thanks! A: Yes, you just need to download the datasets (or just copy the URL from the database (e.g. GEO)) and upload them (or paste the link) in Galaxy by using the data uploader tool. Q2: What do the names of the columns from 5 to the end mean in the 'genes' file? A: Does this column explanation of a bed file helps: http://www.ensembl.org/info/website/upload/bed.html Q3:Can you repeat the explanation about compare two files? A:To compape files in Galaxy, please choose the enable/disable window manager at the top bar. Once this turn yellow, choose the dataset from your history by clicking on the "eye" icon. This will open the chosen dataset in the new window and not in Galaxy. ::: #### Part 1 Naive approach :::warning ##### ✏️ Hands-on: View file content ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q: Cannot do side-by-side, A: What is the issue you got? Probably you need to click in the square icon next to the bell (upper part). - I did, but cannot put two documents side by side. - Could you try to move one of them by clicking on the top bar? You will need to drag the top window and place it on the other side to see the window opened beneath it. - Thanks, finally I did. But the steps are, click on the square icon, then click on the eye button of each file that I can see. Then drag the window to another position. Q: what is the tail means in the tool names? A: Tail refers to the Unix tool "tail", which allows to retrieve the end of files. ::: :::success ##### ❓ While the file from UCSC has labels for the columns, the peak file does not. Can you guess what the columns stand for? - chromosome, start, end positions, length of the peaks on the + strand and -strand - chromosome, start, end, peak lengths - chromosome no., start pos, end pos - first three columns. >> lengths on fourth column - Chromosome, start, end - chromosome, start, end positions and length. - chromosome, start coordinate, end coordinate, length, location ::: :::warning ##### ✏️ Hands-on: View end of file ##### ❓ Are you finished with this section? Add a '+' below + - Yes: ++++++++++++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1:Why do we use the select last tool, means why do we want to cut the ends? Ok thanks! A: Select last tool is only to see how the end of the file looks like - it is not cutting something ::: :::success ##### ❓ How are the chromosomes named? - by number - by number - by number ##### ❓ How are the chromosomes X and Y named? - 20/21 - 20 and 21 - 20/21 - by number 20 and 21 ::: :::warning ##### ✏️ Hands-on: Adjust chromosome names ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1:is there any other way to go to replace text? how to go to the replace-text gui? A: You can find it in the tool search bar under the name "Replace text in a specific column". Alternatively, you can click on the tool name icon, if you opened the tutorial inside the Galaxy interface. Q3:what is the purpose of &? A: `&` is a placeholder for the find result of the pattern search ::: :::success ##### ❓ How many regions are in our output file? You can click the name of the output to expand it and see the number. - 48647 - 48647 - 48647 - 48,647 - 48647 - 48,647 regions ::: Let's come back at 11:50 (CEST) :::success ##### ❓ Are you back? - Yes - No ::: #### Analysis :::warning ##### ✏️ Hands-on: Add promoter region to gene records ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++++++++ - Waiting for the job to be done: + - Need help: ##### Your questions Q1:how do we familiarize with the tools? A: In order to get experience/be able to identify the most important tools for each type of analysis, problably the best approach is to follow the trainings hosted in the Galaxy Training Network related with your topic of interest. In case there's not training available for your specific scientific research field of interest, you can always request it to the Galaxy community :) You can post your request in the Gitter channe (https://matrix.to/#/#Galaxy-Training-Network_Lobby:gitter.im). Additionally, you can use the tool search box in Galaxy and type in your query to check if the tools you are looking for are available and you can also use tool recommendatio feature hosted on Galaxy Europe to know what further tools are available for extending your analysis. An overview for all tool available is here: https://usegalaxy-eu.github.io/tools.html ::: :::warning ##### ✏️ Hands-on: Change format and database ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++++++++ - Waiting for the job to be done: ++ - Need help: ##### Do you need help? Please describe your issue - It is unfortunately taking long for me. - Sorry it can happen. Did you join the TiaaS? - Yes, it is done now :) - ::: :::warning ##### ✏️ Hands-on: Find Overlaps ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1:The number of regions from intersect is quite high, is that usual result from ChIP-seq analysis? A: It is quite dependent of the protein of interest/experimental conditions of your analysis. ##### Do you need help? Please describe your issue - The table contains nothing when I view the results. There are table heads, but no rows below them. - A: Hi, did you verify the parameters you chose from the tutorial? You can also share your history to Berenice or Teresa. - Hi, I sent my history to the email you provided earlier. - It is the same problem for me. - I have the same issue. - Could you check if you are using the same genome assembly version/tool parameters? ::: :::warning ##### ✏️ Hands-on: Count genes on different chromosomes ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++ - Waiting for the job to be done: + - Need help: ::: :::success ##### ❓ Which chromosome contained the highest number of target genes? - chr11 2164 - chr11: 2164 - chr11 2164 - chr11: 2164 - chr11: 2164 - 11 - 11 with 2164 - chr11 2164 ::: #### Visualization :::warning ##### ✏️ Hands-on: Fix sort order of gene counts table ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1:My work is completed and fine, but i am wondering what did we sort exactly? A: Gene counts in descending order ##### Do you need help? Please describe your issue - I have more regions overlaps than the tutorial, likely overlap 34896 regions and group will visualize more expression on chr2 (3380). ::: :::warning ##### ✏️ Hands-on: Draw barchart ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1: What is the chromosome name Zero? A: The chr0 in the mm10 genome assembly refers to a placeholder chromosome, which correspond to unfinished or unplaced sequences. Q2:How can I have all the chromosome names on the x axis? A: You may need to extract a subset of data to have gene names as priting names of a dataset with over 10,000 rows would make the plot look cumbersome. You can extract a subset of data by filtering out on the basis of gene counts for example and then work on smaller subset. To have gene names, use Jupyter notebooks that might need a bit of programming to recalibrate/customize your plot. As alternative, you could make use of https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ggplot2_histogram/ggplot2_histogram/3.4.0+galaxy0 Q3:where can i find my saved visualisations? A: check in the top panel under "user" -> "visualizations" ::: #### Extracting workflow :::warning ##### ✏️ Hands-on: Extract workflow ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++ - Waiting for the job to be done: - Need help: ::: #### Share your work :::warning ##### ✏️ Hands-on: Share history and workflow ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1 If we share the history with a user, can the person run any parts or the workflow? A: Could you provide some more details about the question? - After importing the history the user can use all datasets and run tools/workflows from it. - Yes, it is possible. When sharing a history, the user get rights for modifying the analysis as required.Thanks :) ::: #### General questions :::success ##### ❓ General questions regarding today Q1: Is it possible to get the questions and answers from this document? It maybe usefull also for future reference! A: Yes, it will be made available. Q2: Will we be able to access this HedgeDoc file after the course ends? Or maybe save it as a PDF file? A: We will share a document with everyting at the end Q3:After the workshop, will we have the same access to galaxy platform/tools even if we are not affiliated to any research institution/SME right now? A: Yes ::: ### Summary - first analysis in Galaxy - created a workflow - shared your results and methods with others ### Feedback :::success ##### ❓ One thing that was good about today - Nice tutorial! everyone could follow i guess - speed was okay - nice speed to follow all the steps - clear explanations of performed steps, real-time interactions :-) - Tutorial was easy to follow and perform steps on Galaxy - Very nice tutorial. Good idea to use HedgeDoc (I don't use this tool before). - HedgeDog is super cool! - nice tutorial and easy to follow - Glad to have successfully followed and practised albeit the instructor being fast - Excellent tutorial ##### ❓ One thing to improve - Some steps, require more time. - sometimes, it is not clear why a specific tool or step is needed - Sometimes my Workflow took longer than yours .. so i had to hurry up, but it was all in a good way - maybe some information on how to choose the tool like would there be alternatives or is there only the one we used ##### ❓ Any other comments? - Can you remember the member of group of teachers, who works at zbmed like me - Can you provide more details so that we can understand this question? Thank you! - In the partner staff for this course - , there is ZBMED and, I saw, at the beginning a person comes from there. - Yes ZB MED is a partner and I (Silvia) I am working in BioNT - Excellent, nice to contact to you. My name is Leonardo, and I am working with MAK Collection. - Glad you join as particant :) - Where can I find the recording from zoom session? - We will make them available as soon as possible. But they won't be available in the next days ::: ## Day 2 - Tuesday ### Table of Contents 1. [Welcome](#Welcome) 2. 4. [Repetition of the day before](#Repetition-of-the-day-before) 5. [Slides: Quality Control](#Slides-Quality-Control) 6. [Tutorial: Quality Control](#Tutorial-Quality-Control) - [Inspect a raw sequence file](#Inspect-a-raw-sequence-file) - [Assess quality with FASTQE 🧬😎 - short reads only](#Assess-quality-with-FASTQE:dna::sunglasses:-short-reads-only) - [Assess quality with FastQC - short & long reads](#Assess-quality-with-FastQC---short-&-long-reads) - [Trim and filter - short reads](#Trim-and-filter-short-reads) - [Processing multiple datasets](#Processing-multiple-datasets) - [Assess quality with Nanoplot - Long reads only](#Assess-quality-with-Nanoplot---Long-reads-only) - [Assess quality with PycoQC - Nanopore only](#Assess-quality-with-PycoQC---Nanopore-only) 7. [Slides: Mapping](#Slides-Mapping) 8. [Tutorial: Mapping](#Tutorial:Mapping) - [Prepare the data](#Prepare-the-data) - [Map reads on a reference genome](#Map-reads-on-a-reference-genome) - [Inspection of a BAM file](#Inspection-of-a-BAM-file) - [Visualization using a Genome Browser (IGV)](#Visualization-using-a-Genome-Browser-(IGV)) - [Visualization using a Genome Browser (JBrowse)](#Visualization-using-a-Genome-Browser-(JBrowse)) 9. [Summary](#Summary) 10. [Feedback](#Feedback) ### Welcome Today about quality control and mapping (foundation of HTS analysis) **Location**: Online ([Zoom](https://epfl.zoom.us/j/63351787229?pwd=RmM5c2RzSzVrTmswb2ludHpJOUptQT09)) ### Repetition of the day before :::success ##### ❓ What do you remember from yesterday? - Galaxy allows reproducible analysis as you can rerun and follow back your workflows and also share them with others - working with galaxy: upload data, some analysis tools, extract workflow and change it if needed, share history - uploading data to Galaxy,some usefull tools for data preparation and analysis,running workflow - Uplaoding data, some basic analysis tool, how to extract and edit workflows - One can use the naive approach inplace of UCSC Main - How to upload document, change properties and apply some operations to the file. ##### ❓ Do you have a question from the day before? - Is there a way to fasten the analysis on Galaxy platform after the workshop? - Yes, you can make use of the generated worfklows in order to re-run the analysis on a different set of inputs (automatic execution of the tools on a sequential basis). - Thanks! ::: ### Slides: [Quality Control](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#1) **Disclaimer**: We will not go through the full slidedeck :::success ##### ❓ Any questions regarding this introduction to Quality Control? Q1: The number of charachters of line 4 (quality) should be the same to the line 2? A:Yes, the number of characters in the DNA sequences (line2) should be equal to all quality encoded characters (line4), one quality encoded character for each nucleotide. Q2: This slide, the quality is too bad. Need we discard this data? A: Not, in case of Oxford Nanopore reads, the average quality is usually quite low due to technical limitations. An alternative that you can use for Nanopore data is Nanoplot. Q3: Where can I find more examples, about average quality is good and not good. A: It is quite dependent on the technology used for generating the data. I would recommend to read about the different sequencing technologies, in order to get knowledge about the pro- and contra of each of them. Apart from that, most GTN trainings include a QC step which provide meaningful examples. Additionally to GTN tutorials you can also find examples in the FastQC [documetation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) ::: ### Tutorial: [Quality Control](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) :::warning ##### ✏️ Hands-on: Open the tutorial 1. Go to [Galaxy](https://usegalaxy.eu/) (done in previous Hands-on) 2. Click on the hat in the top bar 3. Navigate to the topic "Sequence analysis" 4. Click on the tutorial "Quality Control" ##### ❓ Have you found the tutorial? (Add a + when done) - Yes: +++++++++++++++++++++++++++++++++++ - No: ##### Your questions Q1: I only find the slide and data yesterday, what is wrong with me? A: Please use this link to the QC tutorial: https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html ![](https://biont.biobyte.de/uploads/56f3eac8-0ff7-4ace-9513-d2bee7516df1.png) -If you navigated to 'Sequening analysis' section, you should see this displayed line above. Here you finde the slide deck as well as the handson section. By klicking on the 'laptop' icon you open the tutorial ::: #### Inspect a raw sequence file :::warning ##### ✏️ Hands-on: Data upload ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++++++ ++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1 the line beginning with (third line) + is always empty? A: Yes, this line is completely useless in terms of information, but it is required for FASTQ files (standard format requirement). ::: :::warning ##### ✏️ Hands-on: Inspect the FASTQ file ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1:Could u plz tell again how did u find 38 phred score for the G? Thanks A: Here you can find the equivalences: [Quality scores](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm) ::: :::success ##### ❓ Which ASCII character corresponds to the worst Phred score for Illumina 1.8+? - 13 ! - ! - ! - ! - ! - ! - ! - ! - ! - ! ##### ❓ What is the Phred quality score of the 3rd nucleotide of the 1st sequence? - 38 - G - G - G - G - G - G - G - G:39 - 39 - 38 - Ascii G with Phred or Q score 38 ##### ❓ What is the accuracy of this 3rd nucleotide? - almost 99.99% - 99.9 - 99.9 - 99.99 - 99.99 - 99.99 - 99.99% - almost 99.99% - 99.99% - 100- 0.00016 = 99,99984 ::: #### Assess quality with FASTQE 🧬😎 - short reads only :::warning ##### ✏️ Hands-on: Quality check (FASTQE) ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++++++++ - Waiting for the job to be done: + - Need help: ::: :::success ##### ❓ What is the lowest mean score in this dataset? - 13 - 13 - 13 - 13 - 13 - 13 - 13 - 13 - 😿 - 13 - 😿(13) - 13 - 13 - 19/13 - 13 - 13 . - 13 ::: #### Assess quality with FastQC - short & long reads :::warning ##### ✏️ Hands-on: Quality check ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++ - Waiting for the job to be done: + - Need help: ##### Your questions Q1: What the triangular with ! stands for? A: Qualty score 5 you can find the full documention [here](https://github.com/fastqe/fastqe#scale) Q2:Please can I have the data once more? A: Sorry, what do you mean, to get the data again on Galaxy? You need to paste [this link](https://zenodo.org/record/3977236/files/female_oral2.fastq-4143.gz) in the **Uploader tool** -> **Paste Fetch data**, and then click in **Start**. Q3:What will happend if your galaxy plateform is full to 100 percent? A: You would need to free some space or in cases where you perfome an analysis exciding your queto you should ask for more space. However this can not be always provieded. ::: :::success ##### ❓ Are you back? - Yes++++++++++++++++++++++ - No ##### ❓ Any questions regarding what we did until now? Q1: On the per tile sequence quality, is it our fault for if there was a mistake in the process? A: If you have a different sequencing quality than the results shown in the session, please check your input and see if you changed one of the parameters by mistake. In general, if you have bad sequencing quality data, please check back with the sequencing facility and check the lab that prepared the data for reasons of this bad quality. ##### ❓ Is the speed fine - Yes:+++++++++++++ - Too slow: - Too fast: ++++++++++ ::: :::success ##### ❓ Which Phred encoding is used in the FASTQ file for these sequences? - Sanger / Illumina 1.9/ in the basic statistic box - Sanger / Illumina 1.9 (Encoding in basic statistics table) - Sanger Illumina 1.9 (Encoding) - Sanger / Illumina 1.9 (encoding in the basic statistics box) - Sanger / Illumina 1.9 - Encoding Sanger / Illumina 1.9 - Sanger / Illumina 1.9 - Sanger / Illumina 1.9 (Basic Statistics) - Illumina 1.9 - Sanger / Illumina 1.9 - Sanger / Illumina 1.9 - Sanger / Illumina 1.9 - Sanger / Illumina 1.9 ::: :::success ##### ❓ How does the mean quality score change along the sequence? - decreases - decreases - decreases from 110-114 bp reads - decreases from 110 - Starts to decrease somewhere from middle of the sequence - decreases - decreases from 110-114 - decreases from 110-114 on - decreases from 110-114 bp - decreases drastically around 100-110 bp - Decreases after 110-114 - start to decrease after 110-114 - decreases from base pair 110-114 ##### ❓ Is this tendency seen in all sequences? - no - from half till the end - no - no, only half of the sequence, with high variability. The end has the worst quality. - No, its seen only from middle towards the end - yes - it is half good and half medium/bad - not all but a lot as the box plots are getting wide/high - not at all - no - not all, quite some variation on the quality at the end of the sequencing (Broader bars) - Not all at the middle - Not all at the middle ::: :::success ##### ❓ Why is there a warning for the per-base sequence content graphs? - the sequence content per base is bad in the begining and the percentages (A/T and G/C) are not equal and constant over the length of the read - B/c we have 16S data so the reads with bad signals are usually expected in the begining of the sequence. Generally there should not be any change in the bases. - because of the high peaks in the beginning and therefore no even distribution of the bases. Bias due to 16S DNA-seq - There is not a proportional content of bases A/T and G/C. Bias in the sequencing, probably due to the type of amplification strategy used. - high peaks in the beggining of the sequences - It is not clear to me ::: ::: success ##### ❓ Your questions Q1: Not super important but out of curiosity: is there a biological/technical reason why 16S DNA has this bias compared to RNA-seq? A: It could be caused by 5' truncated 16S rRNAs with 3' poly(A) tails (which could explain why they are enriched in adenine). A Poly(A) tail structure is normally attached to the 3 ′ end of a mRNA molecule and generally believed to stabilize and protect RNA from degradation. I have not enought information about how the samples were obtained, but it is possible that the 16S rRNA has been purified by usign the [poly(A) tail method](https://pubmed.ncbi.nlm.nih.gov/18265239/), which also could contribute to adenine enrichment. 16S rRNA sequencing targets a specific region of the ribosomal RNA gene, the 16S subunit in the case of bacteria and archaea. This gene is highly conserved across these organisms. The bias observed in 16S rRNA sequencing is mainly due to its specificity for the 16S region. It focuses on a small portion of the genome, limiting the amount of information obtained about the whole transcriptome. - thanks :) ::: :::success ##### ❓ Why is there a fail for the per sequence GC content graphs? - The begining of the sequence content per base is not good and the percentages are not equal - there is a big difference between the theoretical distribution, in blue, and the real one with many peaks. Probably contamination (teacher says) ::: :::success ##### ❓ How could we find out what the overrepreseented sequences are? - Blast# - Blast - check them in Blast - copy and paste and blast - Run it as a query in a nucleotide database - Using the graphic from fastqc report Overrepresented sequences - Through the hits - blast the sequences given that fastqc found to be overrepresented in case there is no possible source ::: ::: success ##### ❓ Your questions Q1:how we would know if the GC content is normal, contamination or bias? and what we could do about this? Is it acceptable up to a level? A: The GC content depends on your organism. You would need to know experiment. Many factor contribute to the GC distribution; for example [Archaea show much higher GC content](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08353-7), since it is involved in the stablization of DNA under hight temperature conditions (C stablish three hidrogen bonds with G, A only two with T). Other factor that can contribute are [GC microsatellite distribution, which is taxa-specific](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6387519/). Usually, [non-normal](https://en.wikipedia.org/wiki/Normal_distribution) GC distribution indicates presence of contaminants or some degree of degradation in the samples. Q2: Does all the values with the percantage more than 0.1% are the ones that are overrepresented? Thanks! A: Yes, but they should be checked with the list of contaminants to find out what they really are. Q3: What is blast? A: BLAST (Basic Local Alignment Search Tool) is a widely used software tool in the field of bioinformatics. It is used for comparing sequences of biological molecules such as DNA, RNA, or protein to identify similarities and potential homologies. For more details, have a look at: https://blast.ncbi.nlm.nih.gov/Blast.cgi Q4: The per tile sequence quality graph that appears to me is different from the one that appears in the tutorial. Is that because the new data (which actually has no red at all) is "better"? Thank you A: Yes, it means that there's not batch effect between your samples. Q5: Could debris in the sample affect the quality of the sequencing? A: Yes, contamination is one the artifacts that can affect the quality. It is possible to evaluate the potential contamination by making use of addtional tools, such as Diamond (faster alternative to BLAST). Q6: 20 is the common value used to trim ? A: Yes, it is usually the threshold, but indeed there's not an "objective" reason that could explain why 20 and not 15 or 25. Also, it is dependent on the kind of sequences you have and your research goals. In this experiment, it has been set to 20 but in different experiments, this value may differ and dependent on many factors such as library preparation and so on. ::: #### Trim and filter - short reads :::warning ##### ✏️ Hands-on: Improvement of sequence quality ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++ - Waiting for the job to be done: ++ - Need help: ##### Your questions Q1:This workflow we are following would be the same for scRNA-seq? A: Yes, usually you follow similar steps. You can find more information about scRNA-seq analysis in the collection of [Galaxy single cell trainings](https://training.galaxyproject.org/training-material/topics/single-cell/). ##### Do you need help? Please describe your issue - my trimmed number is different from the solution, 35.0 compared to 35.1 - Please check the parameters of your tool execution (Cutadapt) and also the version of Cutadapt. - thanks, i mixed the quality cutoff and the nextseq trimming - Cool :) - my report looks completely different than it should I - could you check that you have the correct input files of the tutorial? Was your FastQC report similar to the one in the tutorial? In Cutadapt did you add the adapter sequence to `3' (End) Adapters`? And set the qualty cutoff to `20`? ::: :::success ##### ❓ What % reads contain adapter? - 56.8% - 56.8% - 56.8% - 56,8% - (56.8%) - 56,8 - Reads with adapters: 461 (56.8%) - 56.8% - 56.8 - 56.8% ##### ❓ What % reads have been trimmed because of bad quality? - 35.1% - 35.1% - 35,1% - 35,1 - 35.0% - 84,277 bp (35.1%) - Quality-trimmed: 84,277 bp (35.1%) - 35.1% - 84,277 bp (35.1%) - 35.1% ##### ❓ What % reads have been removed because they were too short? - 0% - 0.0% - 0% - 0% - 0 (0.0%) - Reads that were too short: 0 (0.0%) - 0 - 0.0% - 0% - 0% ::: #### Processing multiple datasets :::warning ##### ✏️ Hands-on: Assessing the quality of paired-end reads ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1:How do you know which one is forward and which one reverse? A: Usally it is stated by the sequence facility. It is a convetion to name the forward `_1` and the reverse `_2` at the end of the file name. ::: :::success ##### ❓ What do you think about the quality of the sequences? - The Phred score is mostly green, also the mean quality score, which is good. - reverse reads have lower quality (in the end) and bad distribution for per base sequence content (QC failed). Also, there is a warning regarding per base content for forward sequences. As well, there is warning for GC content - It looks fine and good enough to go on with the analysis - They are look better than our reads earlier. However, the reverse read on this set seems to fail the quality check on per base sequence content. - one sample has lower quality, but still seems usable I guess - Forward looks good with majority of phred scores in green but reverse is not as good as forward. But overall both look OK - looks good - sequences from forward sequencing look better than the reverse one - Two samples have different qualities. In general I think both are good. ##### ❓ What should we do? - We can trim and remove adapters for reverse - trimm/remove adapters with cutadapt - trim and filter sequences with cutadapt - Trim and filter bad quality regions - Trim them - Trim and filter - Trim to improve overall quality - Trim trim trim but in both files - Trim both files together ::: :::warning ##### ✏️ Hands-on: Improving the quality of paired-end data ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++++ - Waiting for the job to be done: ++ - Need help: ::: :::success ##### ❓ How many basepairs has been removed from the reads because of bad quality? - Quality-trimmed:182,802 bp (2.5%) - 2,5 - 182,802 bp (2.5%) - Quality-trimmed: 182,802 bp (2.5%) - 182,802 bp (2.5%) - 182, 802 bp (2.5%) - 182,802 Bp (2.5 %) ##### ❓ How many sequence pairs have been removed because they were too short? - 1,376 (1.4%) - 1,376 (1.4%) - 1,376 (1.4%) - Pairs that were too short: 1,376 (1.4%) - 1,4+ - 1376 (1.4 %) ::: Let's come back at 12:10 (CEST) :::success ##### ❓ Are you back? - Yes +++++++++++++++++++ - No ##### ❓ Any questions regarding what we did until now? Q1: If you have different single-end seq libraries that belong together (e.g. WT and a knock out mutant) would you process them together in cutadapt but leave the option single end? A: Yes. A recommended approach would be to create a collection of all the reads, pre-process all together and, after performing the trimming/QC evaluation, split the collection according the different experimental conditions. - thanks, would it be the same if you have paired-end for multiple conditions? - In clase of paired-end, it is a better approach to make use of collection of paired-list, one for each experimental condition. Q3:Incase we do have a subset of data and public dataset (from different resources), would be process them at the same process to get a better normalized of batch data? Or we need to do the normalize/correction batch effect step at downstream? For example, data from patient is not easy to get one, so public data is seem to be the best. any idea to reuse this data, that fit for my research study? You mean try to advoid the different of all factors as much as posible? How about the acceptable of data that come from different sources, forexample, mapping rate, expression ratio/tpm. This is indeed a very complex problem. You could try to check the expression of constitutive genes as kind of control. You would expect constitutive genes to have similar expression patterns between samples, even if they belong to different data sources. A: It is usually a bad idea to make use of data from different resources, at least if you pretend to compare different experimental conditions. The reason is that artifacts associated to technical-differences could introduce too much noise. In order to analyze data from different resources you need to check carefully that same instruments/sequencers have been used, and also to evaluate the metadata provided by the data providers (usually they provide additional information about kits used for extract the samples, etc.) ##### ❓ Is the speed fine - Yes: ++++++ - Too slow: - Too fast: +++ ::: ### Slides: [Mapping](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/slides.html#1) **Disclaimer**: We will not go through the full slidedeck :::success ##### ❓ Any questions regarding this introduction to Mapping? Q1: I understand that the alignment allows me to get our reading, probably a piece of the dna, to which positions of the pattern it corresponds. A: Yes, mapping allows to get positions of your reads on the REF genome sequence. Q2: But, which tool should be use? What are the variables to be considered? Choosing an aligner A: Do you mean tool for mapping? There are several tools such as Bowtie2, RNA-STAR, etc that are used for mapping. Each of these tools have different parameters/variables to be considered depending on the kind of algorithm they use internally. Please have a look at these tools in Galaxy to find out more about these tools. One such tool would be used in today's workshop for mapping. A factor to take in account is the capacity of the mapper to account for spliced alignments (in case of RNAseq data, e.g. RNASTAR). On the other hand, if you are working with DNAseq data, BWA-MEM2 can work properly, as there's not necessity to account for splice alignments. You can check the [paper](10.1093/bioinformatics/bts605) suggestion in the prestations. Q3: Format files (SAM BAM), used for results of mapping ? A: Yes, those files store information about mapping. You can find more information [here](https://samtools.github.io/hts-specs/SAMv1.pdf). Q4: I guess for prokaryotic data you would use different mapping tools due to the different genome structure? A: Yes (in case fo RNAseq data), and not (when using DNAseq data you use the same tools). ::: ### Tutorial: [Mapping](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html) :::warning ##### ✏️ Hands-on: Open the tutorial 1. Go to [Galaxy](https://usegalaxy.eu/) (done in previous Hands-on) 2. Click on the hat in the top bar 3. Navigate to the topic "Sequence analysis" 4. Click on the tutorial "Mapping" ##### ❓ Have you found the tutorial? (Add a + when done) - Yes: +++++++++++++++++++++ - No: ::: #### Prepare the data :::warning ##### ✏️ Hands-on: Data upload ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++++ - Waiting for the job to be done: - Need help: ::: :::success ##### ❓ What is a reference genome? - representative genome for a species - the genome that we will use to compare our read. the reference genome should be a sequenced sample of a specific organism. - representative ("mean") genome of the study organism already fully sequenced - a reference genome enables us to map our sequence to the corresponding organism's genome. - A representative reference genome for the specific organisms ##### ❓ For each model organism, several possible reference genomes may be available (e.g. hg19 and hg38 for human). What do they correspond to? - they correspond to different versions of the genome - they are two different versions of reference human genome - they are 2 different versions of human reference genome and hg38 is the most recent one. - Different versions comes from differents sources. ##### ❓ Which reference genome should we use? - mouse reference genome (mm10) - mouse reference genome - in our case - mm10, but in general maybe the last one (last update) - mouse refrence genome (mm10) - mouse reference genome - mouse reference genome ::: :::warning ##### ✏️ Hands-on: Mapping with Bowtie2 ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++ - Waiting for the job to be done: + - Need help: ##### Your questions Q1: In this example, could we use a different tool eg RNA STAR? A: An alternative to Bowtie2 is BWA. We have DNA data, so you should privileged DNA mapper, instead of RNA ones like STAR or HISAT. You could use RNASTAR, but due to the differences in mapping algorithms, the results could differ in certain degree. Also RNASTAR requires much more computational resources (so will take longer). Q2:Is it beneficial to use more than 1 mapping tool in order to get more relevant results at the end?Thanks A: Not, it is usually unnecessary. Some tools allow to perform mapping in two-step mode (such as RNASTAR), which can be necessary in specific situations, such as the identification of new splicing sites. What could be more benfical is to read up about paramters and ajust them to your data. ::: :::success ##### ❓ What information is provided here? - How many reads are mapped or unmapped - Quantitative, not qualitative. (we see amount of matches, not the quality of the sequences) - how many reads were aligned - Results of alignment - Alignment results - Alignment percentages - results for the alignment process ##### ❓ How many reads have been mapped exactly 1 time? - 43531 (87.06%) aligned exactly 1 time - 42434 (84.87%) aligned concordantly exactly 1 time - 44731 (89.46%) - 44731 (89.46%) aligned concordantly exactly 1 time - 84,87% - 44731 (89.46%) aligned concordantly exactly 1 time - 42434 (84.87%) aligned concordantly exactly 1 time - 44731 (89.46%) - 42434 (84.87%) aligned concordantly exactly 1 time - 44731 (89.46%) ##### ❓ How many reads have been mapped more than 1 time? How is it possible? What should we do with them? - 5179 (10.36%) aligned concordantly >1 times - 4555 (9.11%) aligned >1 times - 3389 (6.78%) aligned concordantly >1 times - 3389 (6.78%) - 3389 (6.78%) they fit to different parts of the ref genome - 3389 (6.78%) - 10,36% - 3389 (6.78%) ##### ❓ How many pair of reads have not been mapped? What are the causes? - 2387 (4.77%) aligned concordantly 0 times. Causes, probably difference between the organisms. - 1914 (3.83%) aligned 0 times - 1880, most likely contaminations - 1880 (3.76%) - 4,7 - 1880 - 1880 (14.63%) ::: #### Inspection of a BAM file :::warning ##### ✏️ Hands-on: Inspect a BAM/SAM file ##### ❓ Are you finished with this section? Add a '+' below - Yes: - Waiting for the job to be done: - Need help: ::: #### General questions :::success ##### ❓ Questions Q1: I would like to help with translations of the course tutorials into Spanish, but I have only found information on how to create new tutorials. https://training.galaxyproject.org/training-material/topics/contributing/tutorials/create-new-tutorial/tutorial.html Do you know, how can I do? A: You could try to contact with Wendi Bacon wendi.bacon@open.ac.uk, she was involved in the translatons. Excellent. Thanks A: We will translate the materials as a BioNT activity too, but later in time. Keep an eye on our social channels to stay posted about general translations of BioNT materials (beyond this specific tutorial). Q2: Excuse me, probably the basic question. Is the mapping tool, the tool, that we can use to test filiations (father - son) A: Do you want to use a tool like Deseq2 for differential evaluation or specifically for mapping? I do not exaclty which tool, I think about two samples (not reference) and mapping both to find filiation between both. ::: ### Summary - How to inspect the qualty of a fastq file using FastQC - How can you impove the qualty of your data by trimming it - How can you map sequecing data to a reference genome. ### Feedback :::success ##### ❓ One thing that was good about today - The pace of the class, it was really easy to follow - The speed of the class was perfect and the repetitions after loading the data or each analysis were helpful - the explanations of QC report - we covered a lot of material and the explanations were good - Everything was excellent - The class was interactive and very understandable - The class was very practical and there was a lot of new information to be learned. - The class was well packed with hands on skills ##### ❓ One thing to improve - it was a little bit overwhelming with the amount of information - a little bit faster and packed with a lot of new information than yesterday - So much in a short period of time, reduce the info or increase time - Maybe it would it be better to modify the tutorial guides, or rather not to follow them completely, for us to cover all the agenda for the meeting. ##### ❓ Any other comments? - Thank you and continue with the great workshops! - Thank you for this world class workshop. I got new information today. - Thank you for this workshop! Let's keep going! You are doing a great job! - You are doing a great job! Thank you so much :) - Good job. ::: ## Day 3 - Wednesday ### Table of Contents 1. [Welcome](#Welcome) 2. [Repetition of the day before](#Repetition-of-the-day-before) 3. [Slides](#Slides-Transcriptomics) 4. [Tutorial](#Tutorial-Reference-based-RNA-Seq-data-analysis) - [Data upload](#Data-upload) - [Quality control](#Quality-control) - [Mapping](#Mapping) - [Counting the number of reads per annotated gene](#Counting-the-number-of-reads-per-annotated-gene) 7. [Summary](#Summary) 8. [Feedback](#Feedback) ### Welcome ### Repetition of the day before :::success ##### ❓ What do you remember from yesterday? - QC is always the first step to do, mapping tools perform differently so be aware that there might be differences, quality can be improved by adapter removal and trimming - QC can be perofmed using FASTQC tools. Quality of sequences depends on the sequencing technology used and the experimental conditions. To improve quality of sequences, the reads can be trimmed and filtered using cutadapt tool. - QC evaluation, data quality improvement, mapping sequences to a reference genome - Review the quality and mapping - Check the quality of data and map reads to the reference genome - We checked the quality of the data and we also mapped a sequence - QC, fastq files, Mapping, BAM files ##### ❓ Do you have a question from the day before? - What about the optimal percentage of mapping rate? A 80% mapping bamfile is good enough to continue downstream analysis? - The higher fraction of the reads aligning the reference DB, the better. The ideal value is 100%. It could be a bit higher like >90%, but your number sounds still reasonable in practice. ::: ### Slides: [Transcriptomics](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/introduction/slides.html#1) :::success ##### ❓ Any questions regarding this introduction to Transcriptomics? Q1: Why we not using gene level like TPM for downstream analysis? What the meaning of TPM calling step? A: Because it misses some normalization, e.g. difference in library composition. We will cover that tomorrow Q2:EdgeR and DESeq2 has its own normalisation method, that means that you dont need to do any normalisation beforehand?I mean you need raw data! is that correct? A: Yes. We will see that tomorrow Q3: I read some suggestion that Cufflink counting not correct when compare with other tools like featureCounts, STARcount is that right? A: Cufflink is not the recommended way to go nowdays. We recommend featureCounts Q4: If I perform variant calling using RNA seq, do i need to exclude SNVs present in 3 and 5' UTR region?Will they have a higher chance to be false positive? A: Not, it is usually not required. 5' UTR mutations can impact for example in promoter activity, so still "useful" from biological point of view (e.g. https://pubmed.ncbi.nlm.nih.gov/23027126/). Also mutations in 3'UTR can affect gene expression level at different levels (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6267165/). Why do you expect higher false positives?Because they are not neccesry in the exonic region right? But they are still under selective pressure (e.g. 5' mutations that block expression of essential genes are expected to be removed from the population). So, I think no more false positives would be found in those regions.Thank you for your help:) This is interesting regarding your question I think: [Germline de novo mutation rates on exons versus introns in humans ](https://www.nature.com/articles/s41467-020-17162-z#:~:text=Estimation%20of%20exonic%20and%20intronic%20de%20novo%20mutation%20rate&text=Using%20the%20largest%20dataset%20with,for%20exons%20and%20introns%2C%20respectively). ::: ### Tutorial: [Reference-based RNA-Seq data analysis](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html) :::warning ##### ✏️ Hands-on: Open the tutorial 1. Go to [Galaxy](https://usegalaxy.eu/) (done in previous Hands-on) 2. Click on the hat in the top bar 3. Navigate to the topic "Transcriptomics" 4. Click on the tutorial "Reference-based RNA-Seq data analysis" ##### ❓ Have you found the tutorial? (Add a + when done) - Yes:++++++++++++++ - No: ::: #### Data upload :::warning ##### ✏️ Hands-on: Data upload ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++ - Waiting for the job to be done: +++++++++ - Need help: ##### Your questions Q1: If I search through shared library cannot find. The message `This folder is either empty or you do not have proper access permissions to see the contents. If you expected something to show up please consult the library security wikipage` A: What did you search and where? (please guide me through the steps you did). Search did Libraries GTN - Material > Transcriptomics > then entered GSM461177 Did you enter the folder `DOI...`? Where? In GTN - Material > Transcriptomics > Reference-based RNA-Seq Does that work now? No, I copy the files from the tutorial, but I cannot find if I use shared libraries So it works with the URL? Yes Good :) ::: :::success ##### ❓ How are the DNA sequences stored? - FASTQ file - FASTQ file - In a file, which use a FASTQ format - FASTQ file - FASTQC file - fastq file - fastq - ##### ❓ What are the other entries of the file? - id,comment and quality score - id and comment, quality score - id starts withh @, sequence in the second line, third + and the fourth is quality - id, sequence, quality - id and quality score - id, sequence and quality scores ::: #### Quality control :::warning ##### ✏️ Hands-on: Quality control ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++ - Waiting for the job to be done: +++++++ - Need help: + ##### Your questions Q1:Could u plz repeat why do we flatten our files? Thanks! A: Unfortunately the current version of MultiQC (the tool we use to combine reports) does not support list of pairs collections. So we need to transform our the list of pairs to a simple list before running FastQC Q2:Is it normal the data files have not uploaded since we started? It is taking longer than 15 minutes. Yes. I tried URL- Ohh i get it, thanks. A: Are you using TiaaS? Did you get your data using URL or the shared data library? Using URL it can take some times sometimes if many people are doing the same. Please try using the data from the shared data library ##### Do you need help? Please describe your issue - I have a notice like: User does not have permission to use dataset (hg38|chr15|54546643|54546843|+.fasta) provided for input. when running the fastqc for dataset - here it is: - Where did you get that? On which file? - Can you share your history with berenice.batut@gmail.com? - Few comments: - You created a dataset pair instead of a paired collection -it works now when I tried run the single pair :) - Great - My MultiQC failed, i re-run now - What is the error? - BAM file was not found, i changed the input to fasQC and re run - now it is running - My multiQC gave a red flag - Can you check the error? Using the bug button? - I guess during the exercise we chose BAM file as a source of input by mistake. I corrected it to FastQC, now it is working - Great - Did it work at the end? - Yes! - Hi! I think it is taking a while for my dataset to be flattened. Are there ways to troubleshoot this? - It should be quite instantenous. It is done now? - Not yet. is it okay if i share the url of history here? - You can share your history with berenice.batut@gmail.com - Okay. Shared them atm. - IT seems the flatten worked and that now the FastQC is grey so waiting - When I expand or click the flattened lists, the reads appear to be in orange. - Ah true. - It seems because the input datasets are still uploading. Did you get them using URL or the data library? - I did use the URL. I see now. Thank you. ::: :::success ##### ❓ What is the read length? - Sequence length 37 Total Bases 39.1 Mbp - 37 - 37 - 37 - 37 for all 4 files - 37 - All samples have sequences of a single length (37bp) - 37 - 37 ::: :::success ##### ❓ What do you think of the quality of the sequences? - GSM461177 (fw and rev) and GSM461180 (fw) have high percent of duplications (which is fine for RNA-seq data); GSM461180_reverse has a worse quality, especially at the end - GSM461177_untreat_paired_forward has good quality. GSM461177_untreat_paired_reverse too but, the report Per tile sequen ce quality shows something wrong -red line - (majority blue, is ok) - GSM461180_treat_paired_forward and GSM461180_treat_paired_reverse have both lower quality. But, GSM461180_treat_paired_reverse has the least. Mean quality score (blue line) drops about 16 though these sequences. Per tile shows hotter colours and this indicates worse quality. - Treated samples have a lower quality and both relative high duplication rates, but may with trimming and filtering the quality could increase - both forward libraries have a higher number of duplicates, treated rev sample has lower quality of the reads towards the end, no adapters were found - Overall the quality looks good for reverse and forward untreated and forward treated but treated reverse doest not seem to have good quality - for the treated samples, trimming is necessary ##### ❓ What should we do? - trim and filter - filter - Trim - trim and filter - trim - trim and filter ::: :::success ##### ❓ What is the relation between GSM461177_untreat_paired_forward and GSM461177_untreat_paired_reverse ? - they are sequences of a paired sequencing - They are from the same part of the RNA, from both directions read - Since the data in the study is sequenced using pair end sequencing, the fragment is sequenced from both ends so two files are generated per fragment which are labelled as forward and reverse for each fragment. - these are obtained by paired-end sequencing - Sequence are made of both sides. One is one side, the other is the other side. - Paired_end sequence results F and R from the same sample - from paired-end sequencing ::: :::warning ##### ✏️ Hands-on: Trimming FASTQs ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++ - Waiting for the job to be done: +++ - Need help: ##### Your questions Q1: Why we do not use the flattened dataset? A: Because we want to have forward and reverse together, as in the paired collection Q2:When you decide the paramters the Minimum length and the quality cutoff should be the same? so if we decide another min length we should change also the cutoff A: There is no direct link between the mininum length and the quality cutoff. Miminum length is in bp and will depend on your input data, here we have quite short reads, so a small value. The quality cutoff correspond to a Phred score, so a value between 0 and 60. Is there any value that is kind of common when you decide that cutoff? Usually 20 is a good value for the quality cutoff (50 for minimum length size, if you have sequences around 100bp at least), but also depends on your sequence of interest. For example, when analyzing miRNA-seq samples, the limit should be stablished around 20 nts. Q3:why did we choose single-end cutadapt shouldn't it be the paired-end collection? A: Yes, in cutadapt we need to use paired-end collection as you mentioned. ##### Do you need help? - You can select with the mouse the cutadapt report on the right and then drag it into the source fetching of multiqc. - Thanks for the suggestion ::: :::success ##### ❓ Why do we run the trimming tool only once on a paired-end dataset and not twice, once for each dataset? - Because we are using a Paired-end collection file. Both files F and R should be processed together, in order to don´t lose information and avoid mistmaches between reads. - because if a sequence is removed from dataset, its pair should be removed too - We do not want to loose the pairing info - because we select that we have paired end libraries that will be trimmed at the same time in the same way to keep the reads in the same order if one gets removed - As we have put a collection as an input, we are running files that are related with each other. Analysing the first pair deals with both of the first two at the same time.(Paired end) ::: :::success ##### ❓ How many sequence pairs have been removed because at least one read was shorter than the length cutoff? - GSM461177_untreat_paired_2 1.4 % GSM461180_treat_paired_2 9 % - GSM461177 pair: 147,810 (1.4%) GSM461180 pair: 1,101,875 (9.0%) ##### ❓ How many basepairs have been removed from the forward reads because of bad quality? And from the reverse reads? - GSM461177 pair: Read 1: 5,072,810 bp Read 2: 8,648,619 bp GSM461180 pair: Read 1: 10,224,537 bp Read 2: 51,746,850 bp ::: #### Mapping :::success ##### ❓ What is a reference genome? - We already answered these questions yesterday - Yes :) + - We want to check if you remember :) thanks :D - genome sequence of a specific organism - Reference genome is built with a lot of individuals to obtain a representative genome for specific organism - the representative genome of a species ##### ❓ For each model organism, several possible reference genomes may be available (e.g. hg19 and hg38 for human). What do they correspond to? - different versions - two versions of the same genome - Actualizations of the genome reference. - Each version comes from different sources - They are of different version or an improved version as we discover more about the genomes of our model organisms ##### ❓ Which reference genome should we use? - dm6 - dm6 Drosophila malanogaster - dm6 - Drosophila melanogaster genome ::: :::warning ##### ✏️ Hands-on: Spliced mapping ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++ - Waiting for the job to be done: +++++ - Need help: ##### Your questions Q1: What exactly means to compute coverage? A: It compute the coverage, i.e. number of reads mapping at each bp. Q2:If we want to find this genome that we used in gtf from the UCSC, would we follow the same steps that we did yesterday? Because I am not sure about some of the parameters when I use USCS A: Yes you can use UCSC or other databases. You will probably need to read a bit more about the parameters to figure out which values to select And it doesnt matter what source we are using as long as we use the same version? eg ensemble as a source of the genome Q3:Will we get the same results if we use HISAT2 instead of RNA STAR? Which is the better choice? A: Results would not be completely the same, but pretty similar. You can find a technical comparasion here [Evaluation of Seven Different RNA-Seq Alignment Tools Based on Experimental Data from the Model Plant Arabidopsis thaliana](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7084517). For example, according the paper, "STAR has a higher tolerance for more soft-clipped and mismatched bases compared to HISAT2, which leads to a higher mapping rate for STAR and more unmapped reads for HISAT2".Thanks a lot! Q4: Why are we using both the built in and the gtf file? A: The built-in is only the FASTA sequence of the reference genome. The GTF contains informations about locations of the genes (information not found in the FASTA file). Using the GTF, we can count the number of reads mapped on each gene Q5: For the “Length of the genomic sequence around annotated junctions” do you always take the read-length before trimming? A: Yes. Because the trimming will make reads of different lengths, so hard to know the new length ##### Do you need help? Please describe your issue - I am waiting....for RNA Star output - I am also waiting for rna star output and consequently the multiqc one. And now all the rna star outputs are red and with an exclamation mark. It says the error ocurred with the format bedgraph and the database dm6 : STAR --runThreadN 10 --genomeLoad NoSharedMemory --genomeDir /data/db/data_managers/rnastar/2.7.4a/dm6/dm6full/dataset_73f8ae73-a83e-42ef-93e9-4509ab6685b6_files --sjdbOverhang 36 --sjdbGTFfile /data/dnb09/galaxy_db/files/4/6/2/dataset_4628a937-ede4-4713. Yes. I have just sent it from cincorioslau@gmail.com - What is the error you got? You can find it with the small - Can you share your history with berenice.batut@gmail.com? - You gave wrongs inputs for STAR: it should be the output of Cutadapt. Oh ok, i will check on that. Thx!. Ohhh that was it, i have been trying to do RNA STAR properly and got no results. Thanks, i will change cutadapt settings and do it all over again. Thanks Berenice. - I realised you had the same as Teresa: wrong type of input for Cutadapt - I am still waiting for RNA Star it would be nice if we can slow down - Please only follow Teresa shows. You can try later ::: :::success ##### ❓ Which information do you find in a SAM/BAM file? - sequence, quality, positions on chromosomes - same information as FASTQ sequence, and quality. It is another format. - sequence of the reads, mappingquality .. similar to fastq, but more detailed - ID, position, sequence, quality - Cigar string ##### ❓ What is the additional information compared to a FASTQ file? - mapping information (position, quality) - positions on chromosomes, mapping quality - M a p p i n g. - mapping and cigar string - mapping information ::: :::success ##### ❓ Are you back? - Yes +++++++++++++++++ - No ##### ❓ Any questions regarding what we did until now? Q1: I may need to be shown again the visualization with IGV part if that's possible and how to upload our bam files there if we have IGV downloaded in our PCs - figured - Great ;) Q2: I do not find Sashimi Plot from the menu A: It is not in the menu when you right click on the BAM file section (in IGV)? Yes, in IGV but where is the BAM file section? I did click mouse right The middle one. Thanks, I found it ##### ❓ Is the speed fine - Yes: ++++++ - Too slow: - Too fast: ++ as a lot of us are still waiting for the jobs to run it is ok - Please then focus on what Teresa shows on her screen and you will try that later on your own. We understand it can be frustrating but it is more important that you get the global idea of how things works and are connected and where do find the information for you to dive into later no it is fine! it could be a bit faster in other case. Sure! we are sorry for the long running time. The TiaaS gives a bit of more resources than without it but we still need to rely on an infrastructure shared with >70,000 users i realy understand this! No worries :) ::: #### Counting the number of reads per annotated gene :::success ##### ❓ Look at Fig.19. How many reads are found for the different exons? - gene 1: exon 1 - 3 reads, exon 2 - 2 reads; gene 2: exon 1 - 3 reads, exon 2 - 4 reads, exon 3 - 3 reads - gene 1: exon1: 3 reads, exon2: 2 reads gene2: exon1: 3, gene 2 exon2: 4, and gene2 exon3: 3 - gene 1: exon 1: 2; exon 2: 1; gene 2: exon 1: 1 exon 2: 1 exon 3: 1 - gene 1: exon 1: 3, exon 2: 2; gene 2: exon 1: 3, exon 2: 4, exon 3: 3 - Gene 1: Exon 1=3, Exon 2=2 and Gene 2: Exon 1=3, Exon 2=4, Exon3=3 - gene 1: exon 1: 3; exon 2: 2-- Gene 2: exon 1:3; exon 2: 4; exon 3: 3 ##### ❓ Look at Fig.19. How many reads are found for the different genes? - gene 1: 4 reads / gene 2: 6 reads - Gene 1= 4 reads and Gene 2= 6reads - gene 1:4 ; Gene 2:6 - gene 1 - 4 reads; gene 2 - 6 reads - gene 1: 5 gene2: 10 -- wrong ::: #### Estimation of the strandness #### Counting reads per gene **Follow STAR version of the protocol** :::warning ##### ✏️ Hands-on: Inspect STAR output ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++ - Waiting for the job to be done: + - Need help: ::: :::success ##### ❓ How many reads are unmapped/multi-mapped? - Untreated: unmapped 1190042; multimapped 571204 treated: unmapped 1835969; multimapped 507391 - for GSM461177: Unmapped: 1190042 / Multimapped: 571204; for GSM461180: Unmapped 1835969 / Multimapped 507391 - 1190042 / 571204 - 1190042 unmapped / 571204 multimapped - unmapped: 1190042, multimapped: 571204 - unmapped: 1190042 , multimapped: 571204 - untreated: 1190042 / treated 1835969 - nUnmapped: 118428 multimapped: 57393 ##### ❓ At which line starts gene counts? - 5 - 5 - 5 - 5 - 5 - 5 - 5 - 5 ##### ❓ What are the different columns? - geneid and then counts depending on the strand -both or single - geneID, and then counts depending on which strandiness is choosen (unstranded, forward or reverse) - Gen ID and counts (U, FStrand and SStrand) - GeneId Counts_unstrand ##### ❓ Which columns are the most interesting for our dataset? - Gene Id and the column of unstranded count. (Because we previously looked at an unstranded graph) - gene Id and Counts_unstrand - gene Ids and the counts - gene ID und unstranded count - id and counts unstrand - Counts unstrand ::: :::warning ##### ✏️ Hands-on: Reformatting STAR output ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1:How to proceed with STAR if Infer Experiment gives mixed results - some are unstranded, other - stranded? A: A mixed reult means you have a unstaranded library. Witin the tutorial you can expand a section, which gives you an explainietion how to interpret your Infer Experiment results. ::: :::warning ##### ✏️ Hands-on: Getting gene length ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++ - Waiting for the job to be done: - Need help: ::: :::success ##### ❓ Which feature has the most counts for both samples? (Hint: Use the Sort tool) - FBgn0284245 - FBgn0284245 - FBgn0284245 - perhaps eEF1alpha1 eukaryotic translation elongation factor 1 alpha 1 - FBgn0284245 - FBgn0284245 12869 - 12902 ::: #### General questions :::success ##### ❓ General questions Q1:Is there a possibility that we are sent TiaasS again? A: If you are registerd once you are assigned the whol week (We can share it tomorrow morning) - after the workshop, what can we do to have a faster process without TiaasS? - You can read our nice PHD commics. Usally runs do not take that long. If we are in a Workshop there are many jobs run at once, which is not that much of the case in the day by day work. Q3:Can we use STAR for lncRNA expression? Is it suitable? A: Yes you can. If you do have an specific experiment you could also ask for feedback in the galaxy community. ::: ### Summary - How to check sequencing qulaty using FastQC - How to pre-prosess reads of bad qualety - Map reads using a splice-aware mapper - How to evaluat the qualty of the mapping - Inspect mapped reads with a genome browser - Find the strandness of the tool - Estimat the number of reads per gene ### Feedback :::success ##### ❓ One thing that was good about today - Great job! Thank you both! - really great how the complex content was explained - really nice and detaild explained - well explained - also see the troubles with running the jobs and how to fix mistakes ##### ❓ One thing to improve - don't know if it is possible but in case someone gets stuck because it doesn't run or after a long waiting time failed, if you could share a history that people can keep on working - For participants, maybe we can fetch the needed datasets from the data libraries (if they are available) before the workshop if possible. - possibly to suggest us to install the igv app locally the previous day to have it ready - extend the RNA seq analysis day (wed/thu) a little bit because of the demanding calculations in galaxy ::: ## Day 4 - Thursday ### Table of Contents 1. [Welcome](#Welcome) 2. [Repetition of the day before](#Repetition-of-the-day-before) 4. [Tutorial](#Tutorial:Reference-based-RNA-Seq-data-analysis) - [Analysis of the differential gene expression](#Analysis-of-the-differential-gene-expression) - [Functional enrichment analysis of the DE genes](#Functional-enrichment-analysis-of-the-DE-genes) 7. [Summary](#Summary) 8. [Feedback](#Feedback) ### Welcome ### Repetition of the day before :::success ##### ❓ What do you remember from yesterday? - how to analyse data from RNA seq (check the quality, mapping, counting) - Upload, trimming, quality check of RNA-data, counting reads per genes. - mapping of RNA-seq data in eukaryotes needs the consideration of exon-intron structure and to what reference you want to map it - QC and mapping RNA-Seq data to a reference genome, reads per gene counting - QC, mapping, counts of RNA seq data. - in RNA seq analysis the amount of duplicit sequences is normally higher ##### ❓ Do you have a question from the day before? - Yesterday I had a problem with my pc and the data never uploaded. Would it be possible to get the recording so I can run the tutorial again? - The recordings will be provided once the workshop is over. - Great, thanks! - Yesterday at the end the counting files were two data colections one from each sample containing the counts for rev and Forward. is that right? do we need to unifiy the counts for each sample for the degseq analysis today? - Do you mean that you end up with a count file for forward and reverse? - yes - It could be that within the mapping you did not specify that you have paired end data? And your reads where mapped like single ends? You could check your paramters for RNA-STAR. - ill do thanks!Yes you are right this was the problem! ::: ### Tutorial: [Reference-based RNA-Seq data analysis](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html) Part 2: Analysis of the differential gene expression :::warning ##### ✏️ Hands-on: Open the tutorial 1. Go to [Galaxy](https://usegalaxy.eu/) (done in previous Hands-on) 2. Click on the hat in the top bar 3. Navigate to the topic "Transcriptomics" 4. Click on the tutorial "Reference-based RNA-Seq data analysis" ##### ❓ Have you found the tutorial? (Add a + when done) - Yes:+++++++++++++++++++ - No: ::: #### Identification of the differentially expressed features :::warning ##### ✏️ Hands-on: Import all count files ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++++ - Waiting for the job to be done: - Need help: Shared history: https://usegalaxy.eu/u/berenice/h/ref-based-rna-seq---part-2---070923 ##### Your questions Q1:Should I add them as datasets or as data collection? A: Add them as datasets. Q2: how we automitize it? A: https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html In the Galaxy file uploader, please choose the "collection" tab instead of "regular" for automatic creation of a collection of (to be) uploded datasets. For more details about data uploading in Galaxy, have a look at: https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/upload-rules/tutorial.html Q3:If I look into my GSM files in the collection, it looks different than the ones from your files. The first line is something with GeneID and the name of the GSM file. A: That is the header of the file. The actual data starts from the second line. --> that is clear, but why this header is not in your file? Will this create problems during the following steps? If you have header, then this option in tutorial explainig DeSeq2 section will be used - “Files have header?”: Yes-->okay Q4:And should we have also technical replicates? A: A biological replicat will give you the real/biolocial variance therefore if you have the money go for biological replicates. With tecnical replicates you can teste your experiment setup (technicaly). ::: :::success ##### ❓ Any questions regarding normalization? Q1: so the paired-end sequencing is prefered - are there any drawbacks associated with it? (sample preparation, costs,..) A: The cost is the main drawback, but regarding experimental preparation, requirements are not higher than when compared with single-end samples. ::: :::warning ##### ✏️ Hands-on: Add tags to your collection for each of these factors ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1: So the most important is to have correct names initially in the files? Or is there any way to change names automatically? A: It is not important what name you give to a file (in general). More important is is that you have the correct tags. Since we are extracting the tags now form the names, in this case it is important that you have the correct pattern for the pattern recognition. Q2:Is there any other way to add tags (apart from manually )? A: For any datasets, you can manually add tags by clicking just below their names in a history and using tag names such as #tag-name. But, for adding tags for items in a collection, you will have to follow the approach mentioned in the tutorial. Otherwise, it would be difficult to add tags for 100s of items in a collection manually. Q3:does it matter if the tags are in different order? Sometimes it shows single and paired first and sometimes treated and untreated. A: No, it should not matter. Entities single-treated and treated-single mean the same thing. Q4:These tags are the ones galaxy will use to show me the results for the differential expression, right? A: Yes, basically DESeq makes use internally of the tags in order to organize the samples in groups. ::: :::warning ##### ✏️ Hands-on: Determine differentially expressed features ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++++ - Waiting for the job to be done: ++ - Need help: I need to restart Deseq2 again-now it worked :) ::: :::success ##### ❓ Are you back? - Yes ++++++++++++++ - No ##### ❓ Any questions regarding what we did until now? Q1: I have a general question: if we have the counts in one table, is there any way to split that table in a format regognisable by DESeq2? A: COuld you provide more details? How did you generate the table? In my case, I have directly the counts all in one table provided by our RNA facility. They calculated using STAR. Probably they made use of a similar software to FeatureCounts, but it is important to know which specific method was used for generating the table. Otherwise it is not possible to garantee that the statistical analysis performed by DESeq2 provide meaningful information (due or example to different quantification measures: RPKM, FPKM, TPM, etc.). DESeq2 requires normalized counts. Is DESeq requires normilised counts? We used the raw counts from the STAR (here also in our tutorial), or I didnt understand correctly? You are right sorry, DESeq2 takes raw data and internally performs the normalization. What is the format the sequencing facility provided you? They performed quality control, mapping and counting using STAR. So they return an excel table with the raw counts per sample. For example in EdgeR you can use this table as an input, so I was just wondering if that possible in DESeq2 tool. I never performed the conversion, but I know that technically it is feasible. Q2: Could you why using technical replicates provied more details? - Is this a qustion? e.g. when we have 2 technical replicates we should split counts it together or we do mean of this two file counts? Replicates should be analyze independently. By computing the average you lose a large amount of information, which affects the statistical power. A: Technical replicates can help improve experimental variation. By replicating the same sample multiple times, researchers can gain a more accurate understanding of the natural variation in the data. This can help in distinguishing true biological changes from random noise in the data. Q3: if we have multiple conditions how the results are affected if you calculate them separately compared to have all the parameters in DESeq2 and calculate all the conditions in parallel? A: By analyzing together you can analyze how factors interact each other. And by this, you find the differentially expressed genes between multiple factors. Q4: Seem that DESeq2 with TPM can resolve the problem of different sequencing facilities and depth of sequencing read, so we can using it for compare the same condition with different data resources (like my own data and public data)? Yes it is, both are same method, same tissue sources, like cancer type, for example. I would like to know what the TCGA/METABRIC or other public data did with collaboration? Do we need any tools to correct the batch effect, or only DESeq2 is enough, please? If yes, the batch effect removal (such as Combat-seq, limma:removeBatchEffect) will consider before or after DESeq2 analysis is the best? A: You would like to compare your own results with public data but both generated with the same method? If you have exacly the same experiment but just done in a different lab you could use it together but enter the different 'location' of preparations as a factor. Deseq2 will factor out variabilities occuring because of that it may be e.g. different people preparing the samples (batch effects). Q5:Is it possible to change the appearence of the plots? A: you could change the Alpha value for MA-plot. The plots are generated by Deseq. If you feel confortable with R, you could also contribute to the Galaxy tool wrapper https://github.com/galaxyproject/tools-iuc/blob/main/tools/deseq2/deseq2.R :) ##### ❓ Is the speed fine - Yes: +++++++++ - Too slow: - Too fast: ::: :::success ##### ❓ What is the first dimension (PC1) separating? - untreated left, treated right - treated from untreated samples - Treated from untreated - treatment from untreated - Treated vs Untreated 48 % Variance - treatment effect - Treatment - treated from untreated ##### ❓ And the second dimension (PC2)? - paired down, single up - single-end from paired-end sequencing - single from paired end sequencing - Single from paired end sequencing - Paired vs single - 33 % Variance - single or paired end sequencing - Sequencing type - singled from paired ##### ❓ What can we conclude about the DESeq design (factors, levels) we choose? - The pattern looks expected, keeping the different conditions in mind - The graph shows the factors and the levels we selected for the DESeq which means the dataset was grouped according to the given information - the datasets are grouped according to factors we defined and no other factors that affect our data are observed - The graphh shows groups accords to the factors defined in Deseq - it would be better to have more biolog.replicates sequenced with paired-end method - pattern looks fine, but it might would have been better to stick to one sequencing option ::: :::success ##### ❓ How are the samples grouped? - by treatment (untreated/treated) and, afterwards by sequencing type (single/paired) - treated/ untreated - By similarity (distance) between samples based on treatment and sequencing - by how similar they are to each other - by treatment and sequencing type - based on the similarities between different samples for examples replicates seem to be closer - how similar or dissimilar the samples are on the basis of treatment first and then sequencing - Grouped by treatmen and sequencing ::: :::success ##### ❓ Is the FBgn0003360 gene differentially expressed because of the treatment? If yes, how much? - yes. Its expression is significantly decreased by 8 FC (2^(2.99)) - Yes, it is decreased - yes, it is significantly differentially expressed by log2FC of -2.9 = downregulated - yes with a foldchange of 2^(-2.99) - Expresion in this gene is significantlly lower in treatment compare to control. - Yes, it is differentially expressed with p-adjusted value 2.56135287273603e-170 - Yes it is, it´s expression is 2.9 fold less in comparison to the untreated condition adj p value: 4.04078823111317e-178 - yes, p-adj value is below cutoff, by ~-2.99 log2FC - yes it significantly changed (p adj<0,05) and it is downregulated by around 8 times (2^2,99) - I think yes, the P-adj value is 2.56202717085357e-170 << 0.05 - Yes, its adjusted P-value < 0.05. This means that there is enough evidence that the treated sample is significantly different from the untreated sample. ##### ❓ Is the Pasilla gene (ps, FBgn0261552) downregulated by the RNAi treatment? - Yes -1.6 fold - yes, slightly - log2(fc) is negative so it is downregulated - Yes it is downregulated as the 2log(FC) is negative - yes by ~-1.6 - Yes - Yes, p-value is 1.03485683948963e-31 << 0.05 - yes, significantly by log2FC of -1.6 ##### ❓ We could also hypothetically be interested in the effect of the sequencing (or other secondary factors in other cases). How would we know the differentially expressed genes because of sequencing type? - use DESeq with different parameter - sequencing type instead of treatment - use Deseq with the sequencing type as factor 1 - sequencing as factor 1 in Deseq2 - we should run DESeq2 with switched the factors (first factor: sequencing type, second factor: treatment) - change primary factor to sequencing type - Yes we can do that by simply switching the factors (from treatment to sequence type) in the DESeq (Factor 1 would be sequencing with both levels single and paired) - change DESeq factor level - Transpose the factor 1 (biological) with factor 2 (technical) ##### ❓ We would like to analyze the interaction between the treatment and the sequencing. How could we do that? - in levels of treatment factor include both tags in each level: Treatment and type of sequencing - by running DESeq2 with one factor and 4 levels (one for each condition) - while inserting new factor levels during deseq2, keep the treatment type same but add seqeucing types as different factor levels - By using only 1 factor (treatment)and four factor levels. ::: :::warning ##### ✏️ Hands-on: Annotation of the DESeq2 results ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q1: Is there a way that we don't loose the column names from DESeq result table after annotating? A: The column names will be readded in the next step Q2: My order after the annotation is not exactly the same. At least the first one is a different gene A: Do the overall number of genes match in your dataset? Annotating is just an extension of the previous datasets generated by DESeq2. Q3:So, Over-expressed means Up-regulated, those with positive Fold change? And Down-regulated are those with negative Fold change? What about those with value Zero(0)? A: yes, over-expressed are up-regulated which means genes are more active (+ fold change). Negative fold change shows down-regulation (less active). The values 0 denote no change - neither up- nor down-regulation. ::: :::success ##### ❓ Where is the most over-expressed gene located? - chrX: 10778953-10786907; minus strand - FBgn0025111, chrX, reverse strand - chrX 10778953 10786907 - chrX - chrX - 10778953 10786907 - chr3R 30746684 30747172 ##### ❓ What is the name of the gene? - Ant2 - ProteIin coding Ant2 () - Ant2 - Ant2 - Ant2 - Ant2 - lncRNA:CR43238 ##### ❓ Where is the Pasilla gene located (FBgn0261552)? - Chr 3R - chr3R: 9417939-9455500; + strand - 9417939 9455500 chr3R - chr3R 9417939 9455500 - chr3R - Chr3R, 9417939-9455500, + strand ::: :::warning ##### ✏️ Hands-on: Add column names ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++++++ - Waiting for the job to be done: - Need help: - should we wait? ##### Your questions Q1:The output of the annotate tool its always the same, right?So if we run DESeq2 in our data we can still add the same header? A: yes, you can copy the header from this tutorial and repeat the `Add colum names` for your own data. ::: :::warning ##### ✏️ Hands-on: Extract the most differentially expressed genes ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++ - Waiting for the job to be done: + - Need help: ::: :::success ##### ❓ How many genes have a significant change in gene expression between these conditions? - 1091 - 1091 - 1092 - i found 966 - 955 records - 955 lines - 1092 genes have significant change in expression (padj<0.05) - 966 - 967 with header - 956 with header ::: :::success ##### ❓ How many genes have been conserved? - 20% - 130 - 131 genes : Filtering with abs(c3)>1, kept 12.00% of 1092 valid lines (1092 total lines) - 113 genes (11,79%) - 113 Genes with significant adj p-value & abs(log2(FC)) > 1 - (204 lines or genes)21.36% of 955 valid lines - (205 lines) 21.44% of 956 valid lines - 114 with header - 113 - 204 21.36% of 955 - 130 genes significantly differentially expressed more than 2-fold - 205 with heade ##### ❓ Can the Pasilla gene (ps, FBgn0261552) be found in this table? - yes - yes - yes - yes - yes - Yes - Yes ::: :::success ##### ❓ Are you back? - Yes +++++++++++ - No ##### ❓ Any questions regarding what we did until now? Q1: Why we get different results above? The same data, the same tools, I hope - the same options, but results are different... A: Could you check your seetings again? If you can not find your error you can also share the history with teresa-m@t-online.de so we can have a look.Thanks! AA: Do you use p-value or p-adjusted? Maybe a potential source of confusion? AAA: The differences within the Zenodo and shared data files could be also your issue. Q2:Based on the question in the DESeq2, I run again by combining the tags treated and PE , untreated and PE , treated and SE and untreated and SE, but ended up in error. It was just because we didn’t have enough replicates? Is it feasible to combine 2 tags in one factor level? A: It is required that you have at least 1 replicate (i.e. >=2 inputs per condition). If you dont have enough observations (i.e. separate fastqs), you can reduce the number of factors in your model so that the intra-group variation is calculable. E.g. Germans, Danish, Swedish, Chinese, Japanese, Korean -> Europeans, East Asians. This simplified model will inevitably have a lower resolving power. Q3: I have another general question: If we want to do an analysis and we are not sure how we can perform it in Galaxy, which tool we can use or which series of tools, what you could suggest to do? A: You can ask about how to implement an especific analysis in the [Galaxy Help forum](https://help.galaxyproject.org/) ::: :::warning ##### ✏️ Hands-on: Extract the normalized counts of the most differentially expressed genes ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++ - Waiting for the job to be done: - Need help: ::: :::warning ##### ✏️ Hands-on: Plot the heatmap of the normalized counts of these genes for the samples ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++ - Waiting for the job to be done: - Need help: ::: :::warning ##### ✏️ Hands-on: Plot the Z-score of the most differentially expressed genes ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++ - Waiting for the job to be done: - Need help: I am reviewing my last steps but I have an error when I try to generate my heat maps - Continue please - Could you please check your input data if the table is correct. Could you share the error massage? You can find this in the bug report. If you do not find it you can also share the history with teresa-m@t-online.de (ok) Ready - options(show.error.messages=F, error=function(){cat(geterrmessage(), file=stderr()); q("no",1,F)}) loc <- S ##### Your questions Q1: If we want to add the gene names, how can we add this information? A: Yes, you can use "Labeling columns and rows" option in the "heatmap2" Galaxy tool. Q2:Is there a possibility to modify and edit the heatmaps? If you have for example a lot of samples, the heatmap can be really crowded. A: Yes, this is generated by an R script, and possiblities are: a) You know some programming, and you can perhaps edit the script in Rstudio optimised for your needs. You could try to modify the script in here and launch an interactive tool, for instance: https://github.com/galaxyproject/tools-iuc/blob/main/tools/heatmap2/heatmap2.xml b) Some simple things can be done before feeding the data into the tool, such as filtering the list manually. The heatmap2's output is unfortunately a static image (png) and it is not possible to interactively manipulate it. To elaborate a), Galaxy offers several interactive tools such as Jupyter notebooks, R Studios (such as https://usegalaxy.eu/?tool_id=interactive_tool_jupyter_notebook&version=latest). Using such tools, you cna directly import your tables from a Galaxy history to these tools and create your customized plots using packages such as Bokeh, Seaborn or Matplotlib. ::: #### Gene Ontology analysis :::warning ##### ✏️ Hands-on: Prepare the first dataset for goseq ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++ - Waiting for the job to be done: + - Need help: ##### Your questions Q1:Could you repeat what the float means? A: A float is 4.05 (precision upto 2 digits after decimal) while an int is 4 (with no decimal). Float and Int are data types. Q2:why we changed in uppercase based on the info of the tool, right? A: Yes, using "Change Case" tool. It is required by the tool to work properly. ::: :::warning ##### ✏️ Hands-on: Prepare the gene length file ::: :::warning ##### ✏️ Hands-on: Perform GO analysis ##### Your questions Q1: I did not fully understand why go term analysis requires the gene length if we directly specify if something is significantly regulated or not? A: Depening on the length of the gene statisticly you could have more reads mapping to it. Therefor you would need to consider the length of the gene also for the GO-Term analysis. Q2:Is it possible to generate a report that contains all the grpahs from the analyses that we have done? A: Unfortunately, this is no tool at the moment to generate all plots in a report. Q3:Can we create heatmaps only for the DEGs? In that case we will only need to filter the normalised data based on that information? A:Not sure if I understood your question. You can plot a heatmap with ever data you want you just would need to create according input correctly. Q4:If we use goseq tool in Galaxy and the GO analysis from the website, I imagine that we expect some differences? A: If you use the same versions of tools in Galaxy and its original software on the same dataset, I would not expect differences. ::: ### Summary ### Feedback Thanks!! :::success ##### ❓ One thing that was good about today - The speed of the class, it was really easy to follow, and the graph explanation. Thanks Berenice! - Explanation was excellent - clear explanations of steps and selected options, as well as of obtained results - new Galaxy tools and protocols - really detailed explanation, big help with answering also general questions - great explainations ##### ❓ One thing to improve - suggestion maybe for next time, if possible yesterday and today could be a little bit longer ##### ❓ Any other comments? - Great job, great explanation! Thank you! - Excellent work to explain as much as it was possible and repeatably for us :::: ## Day 5 - Friday ### Table of Contents 1. [Welcome](#Welcome) 2. [Questions on the first 4 days](#Questions-on-the-first-4-days) 3. [Icebreaker](#Icebreaker) 4. [Tutorial](#Tutorial) 7. [Summary](#Summary) 8. [Feedback](#Feedback) ### Welcome ### Questions on the first 4 days :::success ##### ❓ Do you have a question from the days before? Q1: Can we repeat the last step from yesterday (the part on how to get the gene length file for goseq)? That was a bit fast at the end -- yes, thank you :) A: You can toggle the "Create gene-length file" option in featureCounts (or an equivalent tool) to get it reported as a seperate file. AA: There are two possibilities to get the gene lengthes. The first one is as described above, select the output within the `feature counts` tool if you used this one for quantification. Since we used the `STAR` output gene counts we had to use a tool called `Gene length and GC content`. Here the annotation (gtf file) is used to create a table containing the gene length, which is needed for the `GO-seq` analysis. ::: ### Icebreaker :::success ##### ❓ What is your most used emoji? Now that you became a Markdown expert, you can use emojis too! Just go [here](https://gist.github.com/rxaviers/7360908) and copy the one you use most below. :wink: - 😍 - :sunny: - :wink: - :stuck_out_tongue: - :bowtie: - :smiley_cat: - :dolphin: - :smirk_cat: - :sparkles: - 🙃 = :upside_down_face: (not available here) - :relaxed: - :smiley: - 💉 != :syringe: - :revolving_hearts: - :sunglasses: - :satisfied: - :smiley_cat: - 😍 - :wink: ::: ### Bioinformatics Data Types and Databases - [Slides](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/online-resources-gene/slides.html#1) :::success ##### ❓ Any questions regarding this introduction? Q1: FASTQ. Fist line contain some metadata A: Is this a question? Could you be more specific?+ AA: To clarify: For .fastq, the data is organised as 4-line blocks. First line of each 4-line block contains some metadata describing the block (i.e.1 read). Ok Q2: Are the biological data formats unified across different databases or do different databases use their own specific formats? A: They are portable to a certain degree, especially the commonly used ones. However, it does happen that, say, the chromosome naming conventions differ (1, chr1, chrI, Chr1, ...), while the file format is same per se. Hiccups do happen. Q3: Is there a difference between bed and wig/bigwig files? A: BedGraph example: http://genome.ucsc.edu/goldenPath/help/bedgraph.html A: Wig example: http://genome.ucsc.edu/goldenPath/help/wiggle.html AA: https://genome.ucsc.edu/goldenPath/help/bigWig.html (seems binary, whereas bed is essentially a plain text tsv) A: These types of data can be converted to each other, but they "occupy a different amount of space" due to our they are designed Q4: Can you talk about graphical databases. No, about graph nodes References: https://en.wikipedia.org/wiki/Graph_database I am working with Neo4j, but I am interested to know, if is used this or another graph database specifically with bioinformatics. If yes, which A: What type of data are you working with? Indexing Research papers, in this moment, I am indexing Metadata in solr and export to Neo4j, to navigate it A: All right, most of the graph data formats that exist in bioinformatics (e.g. the data in GO) refer to biological entities and not literature information. I am not aware of any file format developed in particular for biological literature information, but maybe it's worth to have a look at Europe PMC (https://europepmc.org/) Go is this ? https://en.wikipedia.org/wiki/Gene_Ontology Q5: Do you know whether ChapGPT/BioGPT can already access the biological data in databases or if you think it can be useful in near future?(how?) A: I dont know, but think they dont scan raw biological data for their training. It would take a lot of resources to include all known genomes, for instance. Though, I am aware of research projects in which they re-purpose these underlying algorithms, but the training set is exclusively bio datasets. A: i think this could be relevant: https://www.sib.swiss/news/bringing-meaning-to-biological-data-knowledge-graphs-meet-chatgpt Q6: Graph database not only to literature, can help to navigate through proteins, structure but I do not know if exists something like this. ::: #### Box to add for every break Let's come back at 10:00 (CEST) :::success #### ❓Are you back? - Yes (+++++++) +++++++++++++ - No ? ##### ❓ Is the speed fine - Yes:+++++++++ - Too slow: - Too fast: ::: ### Tutorial: One gene across biological resources and formats * [Tutorial](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/online-resources-gene/tutorial.html) * [Galaxy Europe](https://usegalaxy.eu/) :::warning ##### ✏️ Hands-on: Open the Genome data viewer https://www.ncbi.nlm.nih.gov/genome/gdv ::: #### Searching Human Opsins :::warning ##### ✏️ Hands-on: Searching Human Opsins ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++++++ - Waiting for the job to be done: - Need help: ::: In the [Genome Data Viewer](https://www.ncbi.nlm.nih.gov/genome/gdv/): :::success ##### ❓ How many hits did you find in Chromosome X? - 5 - 5 - 5 - 5 - 5 - 5 - 5 - 5 - 4 - 5 - 5 ##### ❓ How many are protein coding genes? - 4 - 4 - 4 - 4 - 4 ::: :::warning ##### ✏️ Hands-on: Hands-on: Open Genome Browser for gene OPN1LW In the [Genome Browser](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/genome/?id=GCF_000001405.40) ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++++ - Waiting for the job to be done: - Need help: ::: :::success ##### ❓ What is the location of the OPN1LW segment? - NC_000023.11:154,142,764 - 154,160,511 - 154,142,764 - 154,160,511 - chromsome x:154,142,764 - 154,160,511 - chrX, 154,144,243..154,159,032 - 154,144,243..154,159,032 - NC_000023.11:154,142,764 - 154,160,511 - ChrX: 154,144,243 - 154,159,032 - 154,144,243..154,159,032 - NC_00023.11:154,142,764 - 154,160,511 - NC_000023.11: 154M..154M (17,748 nt) - 154,142,764 - 154,160,511 ##### ❓ What are introns and exons? - Introns are part of the pre-mRNA but will be spliced out. Only the exons make up the protein coding region that will be translated - Exon is the part of dna that contains information to codify a protein - Introns are regions where cut out, in the transcription process. - Exons are coding genes while introns are non-coding genes. - Exons are coding regions of a gene, introns are the non condion sections of a gene ##### ❓ How many exons and introns are in the OPN1LW gene? - 6 Exons and 5 Introns - 6 exons and 5 introns - exons: 6; introns: 5 - 6 exons and 5 introns - 6;5 - 6 exons and 5 introns - 6 Exons, 5 Introns - 6 Exons and 5 Introns ##### ❓ What is the lenght of the protein in number of amino acids? - 364 - 364 - 364 - 364 - 364 aa - 364 :::: :::warning ##### ✏️ Hands-on: Open Genome Browser for gene OPN1LW ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++ - Waiting for the job to be done: - Need help: ::: :::success ##### ❓ What is the first AA of our protein product? - M - M - Methionine - M - M - M (methionine)which is always the first. - M - M (Methionine) - M whcih stands for methionine ::: #### Finding more information about our gene :::warning ##### ✏️ Hands-on: Go to a specific position in Sequence View Start with the [NCBI search](https://www.ncbi.nlm.nih.gov/search/). ##### ❓ Are you finished with this section? Add a '+' below - Yes: - Waiting for the job to be done: - Need help: ::: :::success ##### ❓ Can you guess which type of conditions are associated to this gene? - mutations assoc. with color vision - related to visual color, but there are another references - colorblindness - colour blindness - visual impairments with color deficiencies - colorblindness, - Color vision defects - Eye disorders - colour vision disorders ::: :::warning ##### ✏️ Hands-on: Open OMIM and Read as much as your interest dictates ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++ - Waiting for the job to be done: - Need help: ::: :::success ##### ❓ What is the clinical significance of the rs5986963 and rs5986964? Any difference with the functional consequence of rs104894912? And what is the functional consequence of rs104894913? - first two are benign, the rs104894912 is pathogenic, same for rs104894913 - rs5986963 and rs5986964: benign; rs104894912: stop_gained,coding_sequence_variant (pathogenic); rs104894913: missense_variant, coding_sequence_variant (pathogenic) - rs5986963: benign; rs5986964: benign; the last two are pathogenic - the first 2 are benign (rs5986963 and rs5986964), the 3rd and 4th are pathogenic - rs5986963 and rs5986964be; both benign, the difference from rs104894912 is that rs104894912 is pathogenic. Functional consequence of rs104894913 is also pathogenic but it has missense variance. - rs104894912 stop_gained,coding_sequence_variant - rs104894913 missense_variant,coding_sequence_variant - rs5986963 and rs5986964 are benign and rs104894912 is pathogenic - benign - The first 2 are benign and the last 2 pathogenic ::: :::warning ##### ✏️ Hands-on: Open Protein Back to the [NCBI search](https://www.ncbi.nlm.nih.gov/search/all/?term=OPN1LW) ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++ - Waiting for the job to be done: - Need help: ::: :::warning ##### ✏️ Hands-on: Download the protein sequences ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++ - Waiting for the job to be done: - Need help: ##### Is my pace ok? - Yes: +++++++ - Too slow:+ - Too fast:+ ::: :::success ##### ❓ What does the folder contain? - data report, data table, sequences - gene, transcript and protein sequences, as well as data report and table and dataset catalog, all in data->ncbi_dataset folder and README file - Contains a directory called ncbi - Data table, data report - data table with overview of the protein/gene, the fasta files, data report - Sequences in FASTA format, A table containing info about OPN1LW in tsv format, a data report and data catalog in json format - ncbi_dataset/data/ ├── data_report.jsonl ├── dataset_catalog.json ├── data_table.tsv ├── gene.fna ├── protein.faa └── rna.fna ##### ❓ Do you think they implemented good data practices? - yes - yes, they include also metadata in several typical formats - I am not sure, seems yes. - yes - I am not sure - yes ::: #### Searching by sequence :::warning ##### ✏️ Hands-on: Search the protein sequence against all protein sequences ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++ - Waiting for the job to be done: - Need help: ##### Your questions Q: I have copied, made search but 0 results with this. I will try again. but problem when I copy this content. >NP_064445.2 OPN1LW [organism=Homo sapiens] [GeneID=5956] MAQQWSLQRLAGRHPQDSYEDSTQSSIFTYTNSNSTRGPFEGPNYHIAPRWVYHLTSVWMIFVVTASVFT NGLVLAATMKFKKLRHPLNWILVNLAVADLAETVIASTISIVNQVSGYFVLGHPMCVLEGYTVSLCGITG LWSLAIISWERWMVVCKPFGNVRFDAKLAIVGIAFSWIWAAVWTAPPIFGWSRYWPHGLKTSCGPDVFSG SSYPGVQSYMIVLMVTCCIIPLAIIMLCYLQVWLAIRAVAKQQKESESTQKAEKEVTRMVVVMIFAYCVC WGPYTFFACFAAANPGYAFHPLMAALPAYFAKSATIYNPVIYVFMNRQFRNCILQLFGKKVDDGSELSSA SKTEVSSVSSVSPA A: I copy-pasted your sequence, and it returned many hits. - Are you in the DNA/RNA search mode by accident? No - Are you choosing a reference organism that has no opsin, say, E. coli? - Any non-default scoring parameters you manually chose? Finally run, but I made the same steps and return results. - Glad that it worked out. ::: In [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi): :::warning ##### ✏️ Hands-on: Graphic Summary of the protein sequences ##### ❓ Are you finished with this section? Add a '+' below - Yes: + - Waiting for the job to be done: - Need help: ::: :::success ##### ❓ What is the first hit? Is it expected? - long-wave-sensitive opsin 1 [Homo sapiens], yes that's the protein we used for the search - long-wave-sensitive opsin 1 [Homo sapiens] and yes, it is expected - long-wave-sensitive opsin 1 [Homo sapiens], yes - long-wave-sensitive opsin 1 [Homo sapiens], yes - long-wave-sensitive opsin 1 [Homo sapiens], sure ##### ❓ What are the other hits? For which organisms? - homologs of this protein or very similar proteins, Pan troglodytes, Macaca mulatta, ... - homologs of protein in other species, pan troglodytes(chimpanzee), Nomascus leucogenys: mainly primates - homologs of this proteins in other organisms and other versions of this protein uploaded in the system by different research teams - hits in different organisms due to the protein or gene is conserved; the organisms are mostly primates - [Pan paniscus], [Nomascus leucogenys], [Gorilla gorilla gorilla], [Macaca mulatta], [Papio anubis], [Macaca fascicularis], [Hylobates moloch] ::: :::warning ##### ✏️ Hands-on: Filter a BLAST Search ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++++ - Waiting for the job to be done: - Need help: ::: #### More information about our protein :::warning ##### ✏️ Hands-on: Searching and Open results on UniProt ##### ❓ Are you finished with this section? Add a '+' below - Yes: ++++++++++ - Waiting for the job to be done: - Need help: ::: ### Summary * You can search for genes and proteins using specific text on the NCBI genome. * Once you find a relevant gene or protein, you can obtain its sequence and annotation in various formats from NCBI. * You can also learn about the chromosome location and the exon-intron composition of the gene of interest. * NCBI offers a BLAST tool to perform similarity searches with sequences. * You can further explore the resources included in this tutorial to learn more about the gene-associated conditions and the variants. * You can input a FASTA file containing a sequence of interest for BLAST searches. :::success ##### ❓ Are you back? - Yes: ++++++++++ - No ##### ❓ Any questions regarding what we did until now? Q1: Why if I use filter Human, the results is different (5) instead exactly human OPN1LW 7+ But if you use filter Human (left) Ok, it is clear. A: I checked for "human" in the bovine record, does not appear anywhere visible. Maybe a hidden field? Of course depends, the aggregate field configurated to do the search. ##### ❓ Is the speed fine - Yes: ++++ - Too slow: - Too fast: ::: ### Tutorial: One protein along the UniProt page * [Tutorial](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/online-resources-protein/tutorial.html) * [Galaxy Europe](https://usegalaxy.eu/) :::warning ##### ✏️ Hands-on: Search for Human opsin on UniProtKB In the [UniProt](https://www.uniprot.org/) ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++ - Waiting for the job to be done: - Need help: ::: :::warning ##### ✏️ Hands-on: Open a result on UniProt In the [P04000 entry page](https://www.uniprot.org/uniprotkb/P04000/entry): ##### ❓ Are you finished with this section? Add a '+' below - Yes: + - Waiting for the job to be done: - Need help: ::: #### Entry :::success ##### ❓ What are the available formats in the Download drop-down menu? - Text, Fasta (canonical), Fasta (canonical and isoform), JSON, XML, RDF/XML, GFF - text, fasta, json , xml, gff - Text, FASTA (canonical), FASTA (canonical & isoform), JSON, XML, RDF/XML,GFF - xml in general language for different program interactions, - Text, fasta json, xml, gff ##### ❓ What type of information would we download through these file formats? - I think, but I am not sure, now, JSON, XML and RDF/XML files contains same information in difference format. But Fasta, contains specific content (id, sequence, + and quality) and gff is another info - - FASTA for sequence, text for metadata about the protein - fasta for the protein sequence, gff for the annotation information, text for metadata and sequence - Amino acids sequence, annotation ::: #### Names and Taxonomy :::success ##### ❓ What is the taxonomic identifier associated with this protein? - 9606 NCBI - 9606 - 9606 NCBI - 9606 - 9606 NCBI - 9606 - 9606 NCBI ##### ❓ What is the proteome identifier associated with this protein? - UP000005640 - UP000005640 - UP000005640 - UP000005640 - UP000005640 - UP000005640 ::: #### Subcellular location :::success ##### ❓ Where is our protein in the cell? - membrane - membrane - Membrane ; Multi-pass membrane protein - Multi-pass membrane protein - Membrane ; Multi-pass membrane protein ##### ❓ Is it coherent with the GO annotation observed before? - yes the first GO molecular function is virus receptor activity and there is also cytoskeletal motor activity, structutal molecule activity - yes, the go annotation is photoreceptor disc membranephotoreceptor outer segmentplasma membrane - yes - yes - It is (Blue cone monochromacy (BCM)0) ##### ❓ How many Transmembrane domains and Topological domains are there? - 7 transmembrane and 8 topological domains - 7 transmembrane spanning helixes - 7 transmembrane and 8 topological domains - 7 transmembrane and 8 topological domains - 5 transmembrane and 2 shadow transmembrane- 6 topological domain and 2 shadow topological domain - 8 Topological and 7 Transmembrane domains ::: #### Disease & Variants :::success ##### ❓ What types of scientific studies allow to assess the association of a genetic variant to a diseases? - Correlation studies between specific genetic variant with the presence of diseases. - screen a bigger number of people with the disease - check their genome sequences for common SNPs compared to "healty" people - retrospective population study, where the sequence of the gene from human subjects is determined for health and patient population and statistically evaluated - genome association studies, case-control studies, family-cases studies - Epigenetics ::: #### PTM/Processing :::success ##### ❓ What are Post-translational modifications for our protein? - glycosylation, disulfide bond, a modified residue: N6-(retinylidene)lysine - disulfide bond, glycosylation, modified residue( N6 (retinylidene)lysine) - Glycosylation, Disulfide bond, Modified residue - Glycosylation, disulfide bond and modified residue - Glycosylation, disulfide bond, and modified residue - Phosphorylated on some or all of the serine and threonine residues present in the C-terminal region - glycosylation, disulfide bond and modified residue. - what does the modified residue mean? ::: :::warning ##### ✏️ Hands-on: Open a result on UniProt Search for Human OPN1LW on STRING DB In the [STRING page](https://string-db.org/network/9606.ENSP00000358967): ##### ❓ Are you finished with this section? Add a '+' below - Yes: +++++++ - Waiting for the job to be done: - Need help: ::: :::success ##### ❓ How many different file formats can you download from there? - ... as a bitmap image: - png, svg, tsv, xml, mfa, - 11 - png, svg, tsv, MFA (multi fasta) - png, tsv, xml, MFA - png, tsv, svg - png, tsv, svg, xml(psi/ml), mfa, csv(tab-delimited file) ##### ❓ What kind of information will be conveyed in each file? - png: graphics, pictures, tsv: tab seperate, MFA:aminoacid sequences - interaction map (graphic), node degrees, protein annotations and functions - - I have checked for my gene of interest and it shows that this gene interacts with itself as well. How can it be possible? - portable network graphic: PNG, as a vector graphic: download SVG: scalable vector graphic , as short tabular text output: TSV: , as an XML summary, protein node degrees:, network coordinates: download a flat-file format describing the coordinates and colors of nodes in the network, protein sequences: download MFA ::: #### Structure Back to the [P04000 entry page](https://www.uniprot.org/uniprotkb/P04000/entry) :::success ###### ❓ What is the variant associated to Colorblindess? - 338 - VAR_064054 338 G>E in CBP; dbSNP:rs104894913 ###### ❓ Can you find that specific amino acid in the structure? - yes, in the linking region before the last helix - yes - Yes - Sure - yes ###### ❓ Can you formulate a guess of why this mutation is distruptive? - maybe it disrupts the position or flexibility of this last helix - perhaps it is extramembrane part of the protein affecting the interaction with its partner protein in a complex (change in charge) - maybe conformational change - Absence of functional Red cone ::: :::success #### ❓ Questions Q1: Will you be issuing certificates? A: Please contact us if you want to have a certificate (contact@biont-training.eu) Q2:Is it possible to receive a certificate? A: Please contact us if you want to have a certificate (contact@biont-training.eu) ::: ### Summary * How to navigate UniProtKB entries, accessing comprehensive details about proteins, such as their functions, taxonomy, and interactions * The Variant and Feature viewer are your tools to visually explore protein variants, domains, modifications, and other key sequence features. * Expand your understanding by utilizing external links to cross-reference data and uncover complex relationships. * Explore the History tab for access to previous versions of entry annotations. ### Feedback Q1: Your contact details (email) please. A: contact@biont-training.eu Or check our homepage: http://biont-training.eu/ :::success ##### ❓ One thing that was good about today - Great job! The pace was good, and the atmosphere - relaxed and exellent for work. Thank you, Lisanna! - really detailed explanations, really interesting and informative, i can definitely work based on these information - nice informative introduction to the accessibility of gene and protein related information - overview of different data formats and also other biodata platforms (refresher including new features) - good explanations and informative session regarding different databases and accessed data - Great day. Thanks a lot. Very clearly Lisanna. ##### ❓ One thing to improve - where can we get the link for the recorded video? - - perhaps more illustrative images asside the plain text/data in the initial presentation (from the platforms etc), how they are related - ##### ❓ Any other comments? - Will you send this doc in our emails? - I think we were supposed to ask again if we would like to have a certificate, right? - Yes - Thanks for the whole workshop, great job. - Thanks a lot for all your time and patience! It was really great and helpful! -Thanks!! That was a really intformative workshop Thank you for such an informative workshop. I learnt alot. Thank you all. _ Thank you for the workshop - :+1: :confetti_ball: - Thank you for this workshop! Y'll did great! We learnt a lot because of your efforts <3 Survey: https://survey.bio-it.embl.de/678593?lang=en