# Ecosystem

# platform-init

This Command Line Interactive (CLI) utility creates the structure for a platform's parser. It asks a series of questions and generates the repository structure for the parser with a manifest.json file, a parser's skeleton and an empty test file. The command is interactive and doesn't take any parameter.

Example:

cd ezpaarse/
. ./bin/env
platform-init

# pkb-cleaner

Detects and deletes duplicates in the knowledge bases.

Usage: pkb-cleaner [-nvp] [DIR_TO_CLEAN]

Options:
  --platform, -p   Name of a platform whose PKB should be cleaned.(if provided, ignore dir path)
  --norewrite, -n  If provided, do not rewrite files once the check is complete.
  --verbose, -v    Print all duplicated entries

Example:

pkb-cleaner ./path/to/some/directory
pkb-cleaner --platform=sd

# scrape

Launches the scrapers for one or more platforms. The scrapers are little utility programs to assemble a knowledge base by scraping a publisher's website.

Usage: /home/yan/ezpaarse/bin/scrape [-alvfc] [Platform] [Platform] ...

Options:
  --all, -a      Execute all scrapers.
  --list, -l     Only list scrapers without executing them.
  --clean, -c    Clean PKB files when all scrapers has been executed.
  --force, -f    Overwrite PKB files if they already exist.
  --verbose, -v  Print scrapers output into the console.

Example:

scrape sd cbo # launches the scrapers for SD (ScienceDirect) and CBO
scrape -al    # lists all the existing scrapers without launching them

# loginjector

Streams a log file to a local instance of ezPAARSE.

Example:

zcat monezproxy.log.gz | ./bin/loginjector

Usage:

Injects data into ezPAARSE and gets the response
Usage: node ./loginjector

Options:
  --input, -i     a file to inject into ezPAARSE (default: stdin)
  --output, -o    a file to send the result to (default: stdout)
  --server, -s    the server to send the request to (ex: http://ezpaarse.com:80). If none, will send to a local instance.
  --proxy, -p     the proxy which generated the log file
  --format, -f    the format of log lines (ex: %h %u [%t] "%r")
  --encoding, -e  encoding of sent data (gzip, deflate)
  --accept, -a    wanted type for the response (text/csv, application/json)

This command eases the sending of log files to an ezPAARSE instance, compared to the cURL utility.

# loganonymizer

Anonymizes a log file. The sensitive elements, like the login, machine name or IP address, are replaced with random values. The log file should be sent to the system input (stdin) of the command.

Example:

zcat monezproxy.log.gz | ./bin/loganonymizer

Usage:

Anonymize critical data in a log file
Usage: node ./loganonymizer --input=[string] --output=[string] --proxy=[string] --format[string]

Options:
  --input, -i   the input data to clean
  --output, -o  the destination where to send the result to
  --proxy, -p   the proxy which generated the log file
  --format, -f  the format of log lines (ex: %h %u [%t] "%r")

This is useful for generating test files by removing sensitive items (related to the protection of personal data). Each value is replaced by the same random value so keeping associations and be able to deduplicate is guaranteed.

# logextractor

Retrieves one or more fields in a log file. The log file should be sent to the system input (stdin) of the command.

Examples:

zcat monezproxy.log.gz | ./bin/logextractor --fields=url
zcat monezproxy.log.gz | ./bin/logextractor --fields=login,url --separator="|"

Usage:

Extract specific fields from a log stream
Usage: node ./logextractor --fields=[string] --separator=";"

Options:
  --fields, -f            fields to extract from log lines (ex: url,login,host)  [required]
  --separator, --sep, -s  character to use between each field                    [required]  [default: "\t"]
  --input, -i             a file to extract the fields from (default: stdin)
  --output, -o            a file to write the result into (default: stdout)
  --proxy, -p             the proxy which generated the log file
  --format, -t            the format of log lines (ex: %h %u [%t] "%r")

This is useful for manipulating log files. A common use is extracting URLs from a log file in order to analyze a platform for a publisher. For example, here's how to get the URL for the ScienceDirect platform by sorting alphabetically and deduplicating them:

zcat monezproxy.log.gz | ./bin/logextractor --field=url | grep "sciencedirect" | sort | uniq

# csvextractor

Extracts content from a CSV file. The CSV file must be sent to the system input (stdin) of the command.

Example:

cat monfichier.csv | ./bin/csvextractor

Usage:

Parse a csv source into json.
  Usage: csvextractor [-sc] [-f string | -d string | -k string] [--no-header]

Options:
  --file, -f          A csv file to parse. If absent, will read from standard input.
  --fields, -d        A list of fields to extract. Default extract all fields. (Ex: --fields issn,pid)
  --key, -k           If provided, the matching field will be used as a key in the resulting json.
  --silent, -s        If provided, empty values or unexisting fields won't be showed in the results.
  --csv, -c           If provided, the result will be a csv.
  --json, -j          If provided, the result will be a JSON.
  --jsonstream, --js  If provided, the result will be a JSON stream (one JSON per line).
  --noheader          If provided, the result won't have a header line. (if csv output)

This command is useful for testing the parser directly from the test file by extracting the URL column of the file.

Example (parser test):

cat ./test/npg.2013-01-16.csv | ../../bin/csvextractor --fields='url' -c --noheader | ./parser.js

# csvtotalizer

Produces a summary on the content of a CSV file resulting from a processing of ezPAARSE. The CSV file must be sent to the system input (stdin) of the command.

Example:

cat monresultat.csv | ./bin/csvtotalizer

Usage:

Summarize fields from a CSV stream
Usage: node ./bin/csvtotalizer --fields=[string] --output="text|json"

Options:
  --output, -o  output : text or json                                        [required]  [default: "text"]
  --sort, -s    sort : asc or desc in text mode                              [required]  [default: "desc"]
  --fields, -f  fields to compute from the CSV (ex: domain;host;login;type)  [required]  [default: "domain;host;login;type"]

This is useful for getting a quick overview of a processing outcome of a log file ezPAARSE. By default, domain fields, host, login and type are available in text format. Here is how to know how many different consultation events have been recognized in a sample file:

cat ./test/dataset/sd.2012-11-30.300.log | ./bin/loginjector | ./bin/csvtotalizer

# logfaker

Generates an output stream matching with log lines of a platform on stdout.

Example:

./logfaker | ./loginjector

Usage:

Usage: node ./logfaker --platform=[string] --nb=[num] --rate=[num] --duration=[num]

Options:
  --platform      the publisher platform code used as a source for generating url  [required]  [default: "sd"]
  --nb, -n        number of lines of log to generate                               [required]  [default: "nolimit"]
  --rate, -r      number of lines of log to generate per second (max 1000)         [required]  [default: 10]
  --duration, -d  stop log generation after a specific number of seconds           [required]  [default: "nolimit"]

Useful to test the performance of ezPAARSE.

# pkbvalidator

Checks the validity of a knowledge base for a publisher's platform. This file must conform to the KBART format.

This command checks the following:

  • The presence of the .txt extension
  • Uniqueness of title_id
  • Minimal identification information available
  • Syntax check of standardized identifiers (ISSN, ISBN, DOI)

Usage:

Check a platform knowledge base file.
  Usage: node ./bin/pkbvalidator [-cfsv] pkb_file1.txt [pkb_file2.txt]

Options:
  --silent, -s   If provided, no output generated.
  --csv, -c      If provided, the error-output will be a csv.
  --verbose, -v  show stats of checking.

# ezp process

Let you process one or more files with an instance of ezPAARSE. If no files are provided, the command will listen to stdin. The results are printed to stdout, unless you set an output file with --out.

Options:
  --output, --out, -o       Output file
  --header, --headers, -H   Add a header to the request (ex: "`Reject-Files: all`")
  --download, -d            Download a file from the job directory
  --verbose, -v             Shows detailed operations.
  --settings, -s            Set a predefined setting.

Examples of use :

# Simple case, process ezproxy.log and write results to result.csv
ezp process ezproxy.log --out result.csv

# Same as above, and download the report file
ezp process ezproxy.log --out result.csv --download job-report.html

# Download the report file with a custom path
ezp process ezproxy.log --out result.csv --download job-report.html:./reports/report.html

# Reading from stdin and redirecting stdout to file
cat ezproxy.log | ezp process > result.csv

# ezp bulk

Process files in sourceDir and save results in destDir. If destDir is not provided, results will be stored in sourceDir, aside the source files. When processing files recursively with the -r option, destDir will mimic the structure of sourceDir. Files will use the same or Files with existing results are skipped, unless the --force flag is set. By default, the result file and the job report are downloaded, but you can get additionnal files from the job directory by using the --download option.

Options:
  --header, --headers, -H   Add a header to the request (ex: "`Reject-Files: all`")
  --settings, -s            Set a predefined setting.
  --recursive, -r           Look for log files into subdirectories
  --download, -d            Download a file from the job directory
  --overwrite, --force, -f  Overwrite existing files
  --verbose, -v             Shows detailed operations.
  --list, -l                Only list log files in the directory

Examples of use :

# Simple case, processing files recursively from ezproxy-logs and storing results in ezproxy-results
ezp bulk -r ezproxy-logs/ ezproxy-results/

# Activating reject files and downloading unqualified log lines along results
ezp bulk -r ezproxy-logs/ ezproxy-results/ -H "Reject-Files: all" --download lines-unqualified-ecs.log

A result file (.ec.csv extension) and a report in HTML format (extension.report.html) are generated in the output directory for each log file. If the destination directory is not specified, they are generated in the same directory as the file being processed. If an error occurs when processing a file, the incomplete result file is named with the .ko extension. Rejects files are not retained by ezPAARSE.

Inject files to ezPAARSE (for batch purpose)
  Usage: /home/yan/ezpaarse/bin/ecbulkmaker [-rflvH] SOURCE_DIR [RESULT_DIR]

Options:
  --recursive, -r  If provided, files in subdirectories will be processed. (preserves the file tree)
  --list, -l       If provided, only list files.
  --force, -f      override existing result (default false).
  --header, -H     header parameter to use.
  --verbose, -v    Shows detailed operations.

# Video Demonstration

This screencast (opens new window) demonstrates the usage of ecbulkmaker (ie process a directory containing log files and outputting a mirror directory with the results)

# hostlocalize

Enriches a csv result file containing a host name with the geolocation of the IP address

Example:

./hostlocalize -f ezpaarsedata.csv > ezpaarsedatalocalised.csv

The input file is assumed to contain a field with the ip address for the location

Enrich a csv with geolocalisation from host ip.
  Usage: node ./bin/hostlocalize [-s] [-f string | -k string]

Options:
  --hostkey, -k  the field name containing host ip (default "host").
  --file, -f     A csv file to parse. If absent, will read from standard input.