The ezPAARSE jobs can be configured using HTTP headers. Please find the list of available headers below.
Encoding of the data sent. (supported: gzip, deflate)
Encoding of the data sent back by server. (supported: gzip, deflate)
Output format. Supported:
- text/csv (by default)
- text/tab-separated-values (for a TSV output: as CSV but tab-delimited)
- application/jsonstream (one JSON object per line)
Format of the log lines in input, depends on the proxy xxx used. See the available formats
Date format used in the logs sent. Default is: 'DD/MMM/YYYY:HH:mm:ss Z'.
Comma-separated list of fields that will be crypted in the results, or
none to disable crypting. Defaults to
Caution: each job uses a random salt for crypting, so crypted values for the same access event but from distinct jobs are not identical. Use the
Crypting-Salt header to change this behavior.
A specific crypting key to use if you want fields to be crypted the same way accross different jobs.
The algorithm that should be used to crypt fields. It must be supported by the version of OpenSSL that is installed on the platform. On recent releases of OpenSSL,
openssl list -digest-algorithms will display the available algorithms.
To specify the fields to include in the output (if the format allows it). (More information)
To specify the verbosity level from ezPAARSE's feedback. The higher levels include the lower ones.
- error: blocking errors, abnormal treatment termination.
- warn: errors not fatal to the treatment.
- info: general informations (requested format, ending notification, number of access events generated...).
- verbose: more precise than info, gives more information about each stage of the treatment.
- silly: every detail of the treatment (parser not found, line ignored, unsuccessful search in a pkb...).
List of the reject files to create, separated by commas.
Possible values are:
none by default.
We recommend to set it to
all when you start using ezPAARSE, to fully understand the filtering and exclusion system.
Parameters used for deduplication. (More information).
Character map used for input. (see supported encodings) (opens new window).
Character map used for output. (see supported encodings) (opens new window).
Maximum number of lines that ezPAARSE will attempt to parse in order to check the log format.
If set to
true, ezPAARSE will just filter out the lines we are sure are irrelevant and output only the relevant ones.
The goal when using this parameter is to reduce the size of the log file, if you need to store it for further treatment.
# Video Demonstration
This screencast (opens new window) demonstrates the usage of the Clean-Only parameter (ie the cleaning of a log file for size reduction and ease of storage)
If URLs don't have a
domain part, use this parameter to force the right parser to be used. Useful for Open Access logs analysis, which don't have a domain part in the URL (all URLs comes from the same domain).
Can be used in conjonction with Force-ECField-Publisher.
Listing of the geolocation informations to be added to the results. By default
geoip-longitude, geoip-latitude, geoip-country.
all can be used to include every fiel available, or
none to deactivate geolocation altogether. (More information)
Listing of notifications to send when treatment is done, written as
action<cible> and separated by commas. Currently available:
Insert a list of middlewares that are not present in the base configuration (
EZPAARSE_MIDDLEWARES). The value must be a list of middleware names separated with commas, in the order of use.
By default, they will be inserted at the end of the chain, before
qualifier. You can prefix the list with the mention
(before <middleware name>) or
(after <middleware name>) to insert them at a more specific place, or
(only) to only use the middlewares you want.
v3.7.0 and above]
If you need to insert middlewares at different places, you can declare multiple lists separated with
| (see the example below).
'ezPAARSE-Middlewares': 'user-agent-parser, sudoc'
'ezPAARSE-Middlewares': '(before istex) user-agent-parser'
'ezPAARSE-Middlewares': '(after sudoc) hal, istex'
'ezPAARSE-Middlewares': '(only) crossref'
'ezPAARSE-Middlewares': '(after deduplicator) crossref | (before geolocalizer) host-chain'
false to deactivate data enrichment (geoip and knowledge bases). Any other value will leave the data enrichment active.
Tells ezPAARSE to use a predefined set of parameters. For example:
inist for INIST-CNRS parameters.
false to prevent lines with HTTP status codes 301, 302 from being filtered and discarded.
false to disable filtering on status codes, or provide a comma-separated list of status codes that should be kept.
If you provide your own list, ECs with a status of
403 won't be marked as denied, and will be present in the main result file.
Only keep status 200, 201 and 403
Disable filters applying to robots or arbitrary hosts/domains. (defaults to
Possible values (separated by commas):
all to disable all above filters.
NB: when robots are not filtered, add the
robot field to the output in order to know which consultations were made by robots.
Set the publisher_name field to a predefined value. For example: Force-ECField-Publisher: 'IRevues'.
Change the fields used to generate session IDs and user IDs. By default, the generator uses either
cookie, or a combination of
user-agent, and store the generated IDs in
user_id. You can customize those fields by providing a mapping separated by commas.
Default mapping :
user: login, cookie: cookie, host: host, useragent: user-agent, session: session_id, userid: user_id
If your user login is in the
user_login field :
Extract values from a field and dispatch them in new fields. The syntax is the following :
source_field => extract_expression => destination_fields
The following examples assume we have a login field with the value THEODORE_MCCLURE. Here are multiple ways to create a firstname field containing THEODORE and lastname field containing MCCLURE.
# Extracting with a regular expression:
If the extract expression is a regular expression (between slashes, with optional flags after the closing slash), it's applied to the source field and the captured groups are stored in the destination fields.
The following expression applies the regular expression
/^([a-z]+)_([a-z]+)$/i on the login field, and puts the captured groups in the firstname and lastname fields.
login => /^([a-z]+)_([a-z]+)$/i => firstname,lastname
# Splitting over an expression:
If the extract expression is split(), then the source field will be splitted according to the expression between the parentheses.
The following splits the login field with the character
\_ and puts the parts in the firstname and lastname fields.
'Extract': 'login => split(_) => firstname,lastname'
The following splits the login field with the regular expression
/[\_]+/ and puts the parts in the firstname and lastname fields.
'Extract': 'login => split(/[_]+/) => firstname,lastname'
# Metadata enrichment
The use of middlewares to enrich access events with metadata coming from external APIs is controlled by headers.