gene-expression-chipseq
How the graphs for the paper are done
Use cases
Plot genes from gene ontology(ies) one group per one graph (from DGE table)
Plot genes from gene ontology(ies) multiple groups per one graph (from DGE table)
Counting number of genes in chipseq tables
Processing the chipseq data
Data files description
DGE_source-table.xlsxplace where I did the filtering to get the significant genes keeps all data in excelDGE_all.csvTable of differential gene expression both significant and non significantDGE_significant.csvTable of only significant genes coming fromDGE_all.xlsxDGE_gene-symbols_all.csvjust the gene symbols of the theDGE.csvDGE_gene-symbols_significant.csvjust the gene symbols of the theDGE_significant.csvdata_ChIP-seq_day-10_fixed.csvThe chipseq table containing all the hits
Data files links
Script files
1. DGE_get-unique-gene-symbol.R
DGE_get-unique-gene-symbol.RTakes a single DGE table and returns file with a list of unique genes.
Is useful to get the list of all genes detected in RNAseq or the significantly disregulated.
This is a way how to make a filter out of a DGE table which can then be used to filter chipseq data
2. chipseq_get-unique-gene-symbol.R
chipseq_get-unique-gene-symbol.RThe same as above, but because the original table has
Gene Nameinstead ofgene_symbolas in DGE there are two scripts for this purpose.Can be used to create filters to filter the DGE dataset.
3. DGE_read-and-filter_exact-only.R
DGE_read-and-filter_exact-only.RTakes two imputs:
several filter files (with list of gene names)
single
DGE_*gene expression table as input.
Filters the
DGE*gene expression table.Creates an output folder with a date in the name.
It is case insensitive, but makes exact matches.
4. DGE_plot-top*
DGE_plot-top*Takes multiple files which are the outputs from
DGE_read-and-filter...and creates graph of top x dge genes.All files are in an output folder.
5. append_csvs.R
append_csvs.RAppends csvs created by
DGE_read-and-filter_exact-only.Rfor plotting of multiple groups of genes.Creates a new column called
gene_ontologyand puts the name of the csv file from which that line came from
6. reformatting-chipseq.R
reformatting-chipseq.RReformats the
data_Chip-seq_day-10_fixed.csvto get the binding localizations and intron numbers in nicer format
7. chipseq_read-and-filter_exact-only.R
chipseq_read-and-filter_exact-only.RWorks the same as the
DGE_read-and-filter... .Rbut requires thedata_ChIP-seq_day-10_fixed.csvfiles creates the same kind of output in a folder with date.Is useful to get the genes which are in the DGE table or in the DGE table of significant genes
8. count-binding-sites.R
count-binding-sites.RTakes the reformatted chipseq table (with locations column) and counts the binding sites (all and also in the genes)
9. chipseq_gene-names_by-binding-locations.R
chipseq_gene-names_by-binding-locations.RSplits the reformatted chipseq table by the different binding locations and outputs the list of genes into csv files
Specifications
Filter files
Filter file is a file which contains one gene_symbol (for differentail gene expression tables) or Gene Name (for chipseq data table) per line. Are used to filter the specified genes out of the abovementioned tables.
Gene ontologies filters
Gene onotologies filters are get from AmiGo2.
The filters are stored in _filters_gene-onotologies_specific-genes folder\
Getting the dataset from amigo2:
Select the gene ontology
Select homo-sapiens
Select the gene_label
Save with a name of ontology and the GO number as a .txt file
Other filter files
These are filters for either specific sets of gene
Custom filter hand-picked
Write the gene_symbols into a text file, one gene per line.
Gene groups with common name
For collagens myosins etc using the grep to filter out the names starting with common sequence of characters out of the differentially expressed genes.
grep -ie '^myl' DGE_gene-symbols_all.csv > filt_myl.csv
Next open the file and remove the genes you are not intersted in.
Pipelines
1. Processing the Differential expression data to plot single gene ontologies
Create the filters.
Run the
DGE_read-and-filter_exact-only.csvon selected filters andDGE_all.csvCheck the
output_date..folder for the filtered tablesRun the
DGE_plot-top*.Rscript on the filtered tables.Check the output graphs in the
graphs_date..folder and copy the useful ones into a_selected_graphsfolder
2. Processing the Differential expression data to plot multiple gene ontologies in one graph
Create the filters.
Run the
DGE_read-and-filter_exact-only.csvon selected filters andDGE_all.csvCheck the
output_date..folder for the filtered tablesRun the
append_csvs.Rto create on files from the previous stepRun the
DGE_plot-vertical-grouped.Rscript on the table of appended csvs.Check the output graphs in the
graphs_date..folder and copy the useful ones into a_selected_graphsfolder
3. Counting number of genes in chipseq tables
The chipseq dataset contains multiple entries for single gene. In order to get the list of unique genes in the table or its subset use the chipseq_get-unique-gene-symbols.R to get the list of the genenames where it is easy to count the genes when opening it in excel.
4. Processing the chipseq data
run the
reforrmatting-chipseq.Rscript on thedata_ChIP-seq_day-10_fixed.csvto get additional columnslocation(contains the location of binding promoter, intron, exon, 3UTR etc) andintron_number(numbers of introns from 1to5) in a file calledchipseq_with_locations.csvinoutput_date..folder. Rename it tochipseq_with_locations_all.csvRun the
chipseq_read-and-filter_exact-only.Rwith filterDGE_gene-symbols_significant.csvonchipseq_with_locations_allto get only the non-significantly disregulated genes. Rename it tochipseq_with_locations_signif.csvRun the
count-binding-sites.Ron thechipseq_with_locations_all.csvto count the YAP1 binding to:locations (intron, exon, promoter etc) - count all the binding sites (can bind to more site in one location of one gene)
introns 1 to 5 -count all the sites (again can bind more times in intron 1 for example)
locations but count just the genes where it bound
introns 1 to5 count just the genes where it is bound
Run the
chipseq_gene-names_by-binding-locations.Ron thechipseq_with_locations_all.csvto get the gene name sets for all the locations and the introns 1 to 5 in detail inoutput_date...folder.
Last updated