Takes a single DGE table and returns file with a list of unique genes.
Is useful to get the list of all genes detected in RNAseq or the significantly disregulated.
This is a way how to make a filter out of a DGE table which can then be used to filter chipseq data
2. chipseq_get-unique-gene-symbol.R
The same as above, but because the original table has Gene Nameinstead of gene_symbol as in DGE there are two scripts for this purpose.
Can be used to create filters to filter the DGE dataset.
3. DGE_read-and-filter_exact-only.R
Takes two imputs:
several filter files (with list of gene names)
single DGE_* gene expression table as input.
Filters the DGE*gene expression table.
Creates an output folder with a date in the name.
It is case insensitive, but makes exact matches.
4. DGE_plot-top*
Takes multiple files which are the outputs from DGE_read-and-filter... and creates graph of top x dge genes.
All files are in an output folder.
5. append_csvs.R
Appends csvs created by DGE_read-and-filter_exact-only.R for plotting of multiple groups of genes.
Creates a new column called gene_ontology and puts the name of the csv file from which that line came from
6. reformatting-chipseq.R
Reformats the data_Chip-seq_day-10_fixed.csv to get the binding localizations and intron numbers in nicer format
7. chipseq_read-and-filter_exact-only.R
Works the same as the DGE_read-and-filter... .R but requires the data_ChIP-seq_day-10_fixed.csv files creates the same kind of output in a folder with date.
Is useful to get the genes which are in the DGE table or in the DGE table of significant genes
8. count-binding-sites.R
Takes the reformatted chipseq table (with locations column) and counts the binding sites (all and also in the genes)
9. chipseq_gene-names_by-binding-locations.R
Splits the reformatted chipseq table by the different binding locations and outputs the list of genes into csv files
Filter file is a file which contains one gene_symbol (for differentail gene expression tables) or Gene Name (for chipseq data table) per line. Are used to filter the specified genes out of the abovementioned tables.
Gene ontologies filters
Gene onotologies filters are get from AmiGo2.
The filters are stored in _filters_gene-onotologies_specific-genes folder\
Getting the dataset from amigo2:
Select the gene ontology
Select homo-sapiens
Select the gene_label
Save with a name of ontology and the GO number as a .txt file
Other filter files
These are filters for either specific sets of gene
Custom filter hand-picked
Write the gene_symbols into a text file, one gene per line.
Gene groups with common name
For collagens myosins etc using the grep to filter out the names starting with common sequence of characters out of the differentially expressed genes.
Next open the file and remove the genes you are not intersted in.
Pipelines
1. Processing the Differential expression data to plot single gene ontologies
Create the filters.
Run the DGE_read-and-filter_exact-only.csv on selected filters and DGE_all.csv
Check the output_date.. folder for the filtered tables
Run the DGE_plot-top*.R script on the filtered tables.
Check the output graphs in the graphs_date.. folder and copy the useful ones into a _selected_graphs folder
2. Processing the Differential expression data to plot multiple gene ontologies in one graph
Create the filters.
Run the DGE_read-and-filter_exact-only.csv on selected filters and DGE_all.csv
Check the output_date.. folder for the filtered tables
Run the append_csvs.R to create on files from the previous step
Run the DGE_plot-vertical-grouped.R script on the table of appended csvs.
Check the output graphs in the graphs_date.. folder and copy the useful ones into a _selected_graphs folder
3. Counting number of genes in chipseq tables
The chipseq dataset contains multiple entries for single gene. In order to get the list of unique genes in the table or its subset use the chipseq_get-unique-gene-symbols.R to get the list of the genenames where it is easy to count the genes when opening it in excel.
4. Processing the chipseq data
run the reforrmatting-chipseq.R script on the data_ChIP-seq_day-10_fixed.csv to get additional columns location (contains the location of binding promoter, intron, exon, 3UTR etc) and intron_number (numbers of introns from 1to5) in a file called chipseq_with_locations.csv in output_date.. folder. Rename it to chipseq_with_locations_all.csv
Run the chipseq_read-and-filter_exact-only.R with filter DGE_gene-symbols_significant.csv on chipseq_with_locations_all to get only the non-significantly disregulated genes. Rename it to chipseq_with_locations_signif.csv
Run the count-binding-sites.R on the chipseq_with_locations_all.csv to count the YAP1 binding to:
locations (intron, exon, promoter etc) - count all the binding sites (can bind to more site in one location of one gene)
introns 1 to 5 -count all the sites (again can bind more times in intron 1 for example)
locations but count just the genes where it bound
introns 1 to5 count just the genes where it is bound
Run the chipseq_gene-names_by-binding-locations.R on the chipseq_with_locations_all.csv to get the gene name sets for all the locations and the introns 1 to 5 in detail in output_date... folder.