Commit 74207cf0 authored by María Morales Martínez's avatar María Morales Martínez
Browse files

README updated

parent dd85ee8c
# Histone marks distribution in chromatin beds (IN CONSTRUCTION)
# Histone marks distribution in chromatin beds
<div style="text-align: justify">
This pipeline is dedicated to estimate the tendency of histone marks (or other possible transcription factor) in different established regions along the nuclei. Chromatin extension is divided acording to radial "artificial" regions from the periphery (Lamin) to center of the nuclei and statistics from the distribution of each acetylation/methylation mark in those regions are calculated (mean and std). These regions are defined by their distances that moves away to Lamin B1 protein.
This pipeline is dedicated to estimate the tendency of histone marks (or other possible transcription factor) in different established regions along the nucleus. Chromatin extension is divided acording to radial "artificial" regions from the periphery (bins of chromatin associated to Lamin protein, LADs) to center of the nucleus and statistics from the distribution of each acetylation/methylation mark in those regions are calculated (mean and std). These regions are defined by their distances that moves away to Lamin B1 protein.
</div>
## Type of analysis:
- **Genomic Distance Analysis** (*gd_analysis.py*): establishes the nucleus sections using extensions of LADs genomic coordinates provided by Chip-seq/DamID-seq as measure.
- **3D Distance Analysis** (*hic_analysis.py*): establishes the nucleus sections using interactions between chromatin bins provided by Hi-C experiments as measure.
## Requirements:
Before running the pipeline is necessary:
### Data preparation
### Input data preparation
#### Histone Chip-seq
Zipped Chip-seq data (".bed.gz" format) must be storaged in */data/marks/all*.
Zipped or unzipped data Chip-seq data (".bed.gz" or "bed" format) must be storaged in */data/hisone_marks/all*. If does not exist previously, you have to create a folder with the name of the cell line that contains other folder with the name related with the source of these marks. This last folder contains the histone marks data.
List of chip-seq bed files that are going been used to check the chromatin. Grace to this list, it does not matter if you have several replicates from same sample o more chip-seq files than they required, because only will be selected those which are in the list.
List of chip-seq bed files that are going been used to check the chromatin. Grace to this list, it does not matter if you have several replicates from same sample o more chip-seq files than they required, because only will be selected those which are in the "list of marks".
#### Input Chromatin Bed
Chromatin input bed must be storage in */data/target_bed/[category]/* (created by you).
Chromatin regions of interest to study must be stroaged in ".bed" format in */data/target_bed/*. If does not exist previously, you have to create a folder with the name of the type of chromatin region to study and store the the file inside it.
Establishing different categories avoid mix different types of bed files.
### Input Hi-C
#### Input Hi-C
Processed and normaliced Hi-C data by chromosome (23 files) must be storaged in ".abc" format in */data/Hi-C/*. If does not exist previously, you have to create a forlder with the name of the cell line that contains other folder with the name associated to the proyect. This last folder contains the Hi-C data.
### Defining compartiments to checked
This tool executes by default 6 compartiments for Genomic Distance Analysis and 10 3D Distance Analysis. The parameters to modify these sections are available in the *ranges.py script*.
The Genomic Distance Analysis creates by default the *InsideLAD* (periphery) and center areas. The intermedate limits of sections are establised by the next variables:
- *LAD_ranges*: establishes the area limits.
- *LAD_names*: indicates the name of sections.
```
LAD_ranges = [0, 250000, 1000000, 2500000, 5000000]
LAD_names = ["0kb", "250kb", "1000kb", "2500kb", "5000kb"]
```
** It is necessary to include the 0 to indicate the beginning of the extension sections.
The 3D Distance Analysis sections are defined by percentiles.This way to define the nucelus in sections allows get a similar number of regions of interest between sections. You can define the number of groups that you want changing the next parammeters to generate the percentil list with range function:
- *start*: position to start.
- *stop*: stop position.
- *step*: incrementation.
```
start = 10
stop = 100
step = 10
```
** Change *start* and *step* to the result of rounding the division of 100 by the number of groups.
### Generating LADs files
These files are required by both type of analysis. In Genomic Distance Analysis it will create the files with LADs extended genome coordinates to define the areas and in 3D Distance Analysis it normaliced the LADs coordinates to be comparable with Hi-C bins.
When nucleus sections are defined in *ranges.py*, define in the same *script* the variable *norm* with units to convert the Hi-C bins in contact in pb:
```
norm = 100000
```
** For instance: 54 Hi-C bin = > 5400000 pb
Then launch *LAD_interval.py* with the cell line to which the LADs belong and LAD in *LADS/source* that you are going to use. For example:
**Not inclueded in this example**
```
python LAD_interval.py H1-hESC Lamin_B1
```
### Installation
This tool is programmed under python==3.7.8.
This tool is programmed under python(3.7.8).
To ensure the success of the installation of all packages and dependencies, it is necessary activate the virtual environment.
......@@ -50,34 +102,49 @@ Execute the following command adding the required information for the arguments.
python run_smk.py [-h] -c CONFIG -s SMK
```
- **Configuration (CONFIG)**: Input file with all the parametters required to run the pipeline: type or regions in chromatin to contrast (target_category), bed in side of this category to be analysed, histone chip-seq dataset with all samples, list of histone chip-seq to use in the analysis.Save new config file (".json") in config folder.
- **Configuration (CONFIG)**: Input file with all the parametters required to run the pipeline: cell line of the histone marks and LADs samples (cell_type), type or regions in chromatin to contrast (target_category), bed in side of this category to be analysed (target_bed), histone chip-seq dataset folder with all samples ("marks_dataset"), list of histone chip-seq to use in the analysis ()"marks_list"), name of LADs file ("LAD_bed").Save new config file (".json") in config folder.
- **Snakemake workflow (SMK)**: snake file with the description of the workflow followed by rules that determine the order of data processing. Placed in sanakemake folder.
*Quick summary:*
Config file model:
```
python run_config.py -h
```
{
"cell_type":"H1-hESC",
For more detailled information, consult attached documentation in *pipelines/documentation*.
"target_category":"genes",
"target_bed":"genes_IDR800_10kb",
## Example
"marks_dataset":"Encode",
"marks_list":"H1_Encode.txt",
"LAD_bed":"Lamin_B1"
}
```
The tool provides an example to guarantee the perfect execution that can be carried out by the user.
- **Snakemake workflow (SMK)**: snake file with the description of the workflow followed by rules that determine the order of data processing. Placed in sanakemake folder.
In the next command example, the configuration file and the script that collects the workflow instructions to run the analysis are indicated.
*Quick summary:*
```
python run_config.py -c encode_nre.json -s analysis.smk
python run_config.py -h
```
There is another configuration file for 3000 top highly expressed genes in *encode_transcription.json* that can be run by the same user.
For more detailled information, consult attached documentation in *pipelines/documentation*. **TO UPDATE: only old version available.**
## Example
## Next updates:
The tool provides as an example the data of the case study used to validate it: Histones as possible acetylation reservoir in genes and neutral regions.
- Add graphics that support the understanding of the internal analysis processes.
To guarantee the perfect execution that can be carried out by the user, in this example all input data previously explained are included
- Add Hi-C bed to evaluate the real 3D chromatin distribution of histone marks.
In the next command examples, the configuration file and the script that collects the workflow instructions to run the analysis are indicated.
- Add test to find bias among histone marks in differet LAD regions.
**Genomic Distance Analysis**
```
python run_config.py -c encode_genes.json -s gd_analysis.smk
```
**3D Distance Analysis**
```
python run_config.py -c encode_nre.json -s hic_analysis.smk
```
......@@ -10,7 +10,7 @@ from ranges import * # LAD_ranges LAD_names
LAD_cell = sys.argv[1] # H1-hESC
LAD_bed = sys.argv[2] # Lamin_B1
norm = int(sys.argv[3]) # norm=100000
#norm = int(sys.argv[3]) # norm=100000
# Create dirs to save bed files for gd and hi-c analysis
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment