Overview of MaRe pipeline

overall_scheme

Input parameters

Single line textfields (except date field)

Different entries may be separated by 'space' and/or 'tab' symbols. Several separators following one after another without any term between them are treated as a single separator.

'&', '|', 'AND', 'OR', 'and' and 'or' are treated as logical operators (unless placed within quotes "...").

'(' and ')' can be used to make any complex queries (unless placed within quotes "...").

Operators have to be separated from terms by white spaces, for example: 'word & (word | word)', but not 'word&(word|word)'.

Quote strings with double quotes when searching on terms composed of several words and/or containing at least one of the following symbols: .  \  |  (  )  [  ]  {  }  ^  $  *  +  ? In particular quote authors if you use initials.
E. g.: 'brain AND ("immature neurons" OR "CBA/CaH (CBA) mice")'
E. g.: 'Hoffman OR "Joshi V"'

Single quotes are treated as apostrophes. E.g.: 'Dupuytren's Disease'

Unquoted terms composed of several words will be automatically split into several terms connected to each other with "AND". E.g.: 'smooth muscle OR "skeletal muscle"' => '(smooth AND muscle) OR "skeletal muscle"'

Wildcards (*) can be used to extend terms in a query. When quoted, wildcard will not extend a term but will be searched for as a '*' symbol.


Multiline textareas (accessions and platforms fields)

Queries can be written or pasted directly in the text areas or uploaded as a .txt file. Different entries may be separated by 'new line', 'space' and/or 'tab' symbols. Several separators following one after another without any term between them are treated as a single separator. Lists are made non-redundant before execution of the query.

Two lists of terms, one from a textarea and one from an uploaded file, are merged in one non-redundant list.


Box A

BoxA

The following terms can be used as input:

    - GEO experiment accession numbers (e.g. GSExxx, GDSxxx),
    - ArrayExpress experiment accession numbers (e.g. E-XXXX-123),
    - PubMed IDs of papers that refer to microarray data submitted to GEO and ArrayExpress (plain numbers are treated as PMIDs, e.g. PMID11823445 or 11823445 are treated equally) .

Order of types of terms is not important.

Maximal length: unlimited.

Example: 'GSE8678 GDS2262 11823445 12620965 E-GEOD-1925 E-MEXP-1148 GDS2598'


Box B

BoxB

Authors and keyword searches can be performed both in the textual annotation of the microarray repositories (to get experiments directly) and in PubMed. In the latter case, the retrieved abstracts are subsequently linked to the associated data in the microarray repositories.

Search in Pubmed often provides relevant experiments which are not found by searching in experiment annotations only.


Authors

Quote authors with double quotes if you use initials.

In GEO, names of authors are in a unique format: 'Surname I' and 'Surname IN', where 'I' and 'N' are initials. It is usually effective to search GEO with 'Surname' or '"Surname I"' as a term.

In ArrayExpress, name format for the same author may vary from entry to entry (for example, 'Name N Surname', 'Surname N.N.', 'Name Surname', etc). 'Surname' can be recommended as a term to avoid missing relevant data.

Use single quote as an apostrophe.

Maximal length: 3000 symbols.

Example: 'Hoffman OR "Joshi V"'


Keywords

Quote keywords with double quotes when searching on terms composed of several words.

For search in GEO, Entrez limits to individual terms can be specified in square brackets. E.g.: brain[Title] - for search in annotations, Science[Journal] - for search in PubMed. These limits will be ignored in AE search.

Maximal length: 3000 symbols.

Example: 'brain[Title] AND ("immature neurons" OR "CBA/CaH (CBA) mice"'


Box C

BoxC

Query logics: ( Species AND Date ) AND ( Platform keywords AND/OR Platform accessions )


Species

Full species names should be quoted.

Genus names can be used, e.g. 'Xenopus' allows to find Xenopus laevis, Xenopus tropicalis and other species of the genus.

Maximal length: 3000 symbols.

Example: '"Homo sapiens" AND ("Mus musculus" OR "Rattus norvegicus")'


Date of entry

A search on the date of submission to the databases (not on the date of publication of a paper) is performed. The following formats are accepted:

'2006' - whole year 2006
'2006/01' - whole January, 2006
'2006/01/04' - 4 January, 2006

'2005:2006' - interval from the beginning of 2005 until the end of 2006
'2006/01:2006/06' - interval from the beginning of January, 2006 until the end of June, 2006
'2006/06/01:2006/06/03' - interval from 1 June, 2006 until 3 June, 2006

'2006/06:2006/06/03' - interval from 1 June, 2006 until 3 June, 2006
'2006/06/01:2006' - interval from 1 June, 2006 until 31 December, 2006

Only "|", "OR" and "or" can be used as logical operators.

"(", ")" and other symbols which don't agree with the described format are not allowed.

Maximal length: 3000 symbols.

Example: '2005:2006 OR 2007/01:2007/06 | 2006/01/04'


Platform keywords

For search in GEO, Entrez limits to individual terms can be specified in square brackets, e.g. 'Affymetrix[Title]'. These limits will be ignored in AE search.

Maximal length: 6000 symbols.

Example: 'Illumina OR (Affymetrix AND GeneChip)'


Platform accessions

GEO platform accessions (GPLxxx) and ArrayExpress platform accessions (A-XXXX-123) can be used.

Maximal length: unlimited.

Example: 'GPL91 A-MEXP-174 GPL92'


Query logics

query_box

Choose how the results found separately for A, B and C boxes should be combined.

Select the logical operators and select the brackets.

User-defined 'AND' operators can be switched to 'OR' operators during the check of input if some of the boxes were left empty.


Search options

Options

Choose if to search for 'Experiments and platforms' or for 'Platforms only'.

Note that when experiments are searched for based on an input in the 'Platforms' section in box C, first platforms meeting the terms are found, second the list of experiments in which at least one of the platforms has been used is retrieved, finally the list of ALL platforms for these experiments is retrieved (including platforms which do not meet the input 'Platforms' terms). Platforms which were found at the first step but have no experiments in the repository will be ingored in this case. When searching on 'Platforms only', all terms except 'Platforms' in box C are ingnored and only platforms which themselves meet the terms are retrieved (unlike in the search for 'Experiments and platforms').

Select the database (GEO and/or ArrayExpress) to search in.

When 'Retrieve only GSE' is chosen for GEO, only GEO Series which meet the query terms are retrieved.

When 'Retrieve only GDS' is chosen for GEO, only GEO DataSets which meet the query terms are retrieved.

When 'Retrieve GSE and GDS' is chosen for GEO, GEO Series and GEO DataSets which meet the query terms are retrieved. GEO Series, which do not meet the terms themselves but have GEO DataSets which do meet them, are also retrieved. GEO DataSets, which do not meet the terms themselves but refer to GEO Series which do meet them, are retrieved as well. This system is consistent with the one on the GEO web-site.

ArrayExpress options allow to disregard ArrayExpress entries for which duplicate GEO entries have already been found in the same search. Please, note that if an entry is present both in GEO and AE but in the given search has been found only in AE, it will be retrived from AE, not from GEO. There are several reasons why an entry present in the both databases may be found only in one of them. First, the annotations in GEO and AE can differ from each other (e.g., GSE934 in GEO is associated with PubMed ID 15867358, whereas E-GEOD-934 in AE is associated with PubMed ID 15992546; GSE2531 does not contain the word "cancer" in its annotation whereas E-GEOD-2531 does). Second, there are slight differences in the sets of annotation fields used for searching by GEO and AE, for example GEO Entrez does not take into account the citation information on microarray experiments whereas the AE search on the EBI server does. To provide results consistent with those that can be obtained on the web-sites of the databases, MaRe retains the slight differences in the search. Finally, limits for the MaRe 'Keywords' input field are passed only to GEO Entrez and PubMed but not to AE (e.g, the "brain[Title]" MaRe query results in search in the titles of GEO entries but in the whole AE entries).

'Retrieve raw data' checkbox can be used to download not only processed but also raw data from GEO and ArrayExpress. Raw data are usually much more bulk than the processed and therefore are not downloaded by default.


Start search

email

An e-mail address has to be entered to start the search.


Differences of MaRe search from search on the Entrez GEO web site

In MaRe unquoted terms composed of several words are automatically split into several terms connected to each other with "AND". After such a processing the query is forwarded to Entrez GEO, PubMed and/or AE.
In contrast, on the Entrez GEO web site unquoted terms composed of several words may be processed in a context dependent manner.
E.g.: 'breast cancer' --> (MaRe) 'breast AND cancer' ==> (Entrez) '("breast"[MeSH Terms] OR breast[All Fields]) AND ("neoplasms"[MeSH Terms] OR cancer[All Fields])'
E.g.: 'breast cancer' ==> (Entrez) '"breast neoplasms"[MeSH Terms] OR breast cancer[All Fields]'

Qouted with double quotes terms composed of several words are processed by MaRe and the Entrez GEO web site identically.
E.g.: '"breast cancer"' --> (MaRe) '"breast cancer"' ==> (Entrez) '"breast cancer"[All Fields]'
E.g.: '"breast cancer"' ==> (Entrez) '"breast cancer"[All Fields]'


Differences of MaRe search from search on the AE web site

Double quotes can be used in MaRe for terms composed of several words but are considered invalid characters in queries on the AE web site.
E.g.: '"breast cancer"' --> (MaRe) '"breast cancer"'
E.g.: '"breast cancer"' --> (AE) 'Invalid character found in text search field!'

Unquoted terms composed of several words are processed indentically by MaRe and the AE web site, i.e. split into parts connected to each other with "AND".
E.g.: 'breast cancer' --> (MaRe) 'breast AND cancer'
E.g.: 'breast cancer' --> (AE) 'breast AND cancer'


The following search schemes have been implemented:

For search in GEO

GEO

Comments to the 'Search in GEO' scheme:

        'GEO' - annotation of entries in the GEO database
        'PubMed' - abstracts in PubMed
        'GSE' - Series entries in GEO
        'GDS' - DataSets entries in GEO
        'GPL' - Platforms entries in GEO

Search in GEO is performed distantly on the NCBI server using E-Utilities.


For search in ArrayExpress


AE

Comments to the 'Search in ArrayExpress' scheme:

        'AE' - annotation of entries in the ArrayExpress database
        'PubMed' - abstracts in PubMed
        'EXP ACCN' - Experiment entries in ArrayExpress
        'PL ACCN' - Platforms entries in ArrayExpress

Search in ArrayExpress is performed locally using ArrayExpress annotation xml file. The file is downloaded by MaRe from EBI during the search if the previous version of the file is older than one hour.

Choosing entries to download

After the search is complete, the results of the search are displayed as an interactive html page.

Choose entries and press 'Download'.

An e-mail message is immediately generated to inform the user that the downloading job has been started.

Files for the chosen entries are downloaded from GEO and ArrayExpress to the MaRe server, archived files are unpacked, all files are placed into a hierarchical system of folders, the latter is finally packed into a single TAR ball.

After the job is ready (this can take several hours for large jobs), an e-mail is sent to you with the URL link to the TAR ball on the MaRe server. Now you can get the downloaded files all together and untar them locally on your computer.

Example of a downloaded job

output

Marked with '*' are files and folders which are kept in GEO and ArrayExpress as archives and which were unpacked during the downloading process.

Disk space remarks

Currently, the disk space available on the MaRe server is limited to 150Gb. Downloaded files are deleted from the server daily. If the server runs out of disk space because of a large number of jobs within a day, it will have space available again after 24 hours.

Downloading from MaRe server

Data from the MaRe server should be downloaded to the user local machine no later than 24 hours after the link to the complete job has been sent to the user. Whether or not downloaded by the user, the data will be removed from the MaRe server in 24 hours after the link has been sent.

Other remarks

Sometimes E-Utilities work slowly because of problems on the NCBI server, MaRe also works slower than usual because of that. When this occurs (for example, when search for a single entry in GEO takes more than 1 minute), the problems in NCBI are usually solved in a few days and MaRe starts working quickly after that again.