From 4212e063092473c2486469305fa5828e8dbde725 Mon Sep 17 00:00:00 2001
From: Ed Morley <edmorley@users.noreply.github.com>
Date: Tue, 5 Sep 2017 14:32:04 +0100
Subject: [PATCH] NLTK support: Fix passing of multiple corpora identifiers
 (#460)

* NLTK support: Update test to use multiple corpora

So that the incorrect handling of multiple IDs seen in #444 would
have been caught.

Also switches to some of the smaller corpora, to reduce time spent
downloading during tests (see sizes on http://www.nltk.org/nltk_data/).

* NLTK support: Fix passing of multiple corpora identifiers

As part of fixing the shellcheck warnigns in #438, double quotes had
been placed around `$nltk_packages` passed to the `nltk.downloader`,
which causes multiple identifiers to be treated as though it were just
one identifier that contains spaces.

The docs for the shellcheck warning in question recommend using arrays
if the intended behaviour really is to split on spaces:
https://github.com/koalaman/shellcheck/wiki/SC2086#exceptions

As such, `readarray` has been used, which is present in bash >=4.
The `[*]` array form is used in the log message, to prevent shellcheck
warning SC2145, whereas `[@]` is used when passed to `nltk.downloader`
to ensure the array elements are unpacked as required.

Note: Both before and after this fix, using anything but unix line
endings in `nltk.txt` will also cause breakage.
---
 bin/steps/nltk              | 7 +++----
 test/fixtures/nltk/nltk.txt | 3 ++-
 test/run                    | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/bin/steps/nltk b/bin/steps/nltk
index 92f3fff7..b2ded512 100755
--- a/bin/steps/nltk
+++ b/bin/steps/nltk
@@ -21,10 +21,10 @@ if sp-grep -s nltk; then
 
     if [ -f "$nltk_packages_definition" ]; then
 
-        nltk_packages=$(tr "\n" " " < "$nltk_packages_definition")
-        puts-step "Downloading NLTK packages: $nltk_packages"
+        readarray -t nltk_packages < "$nltk_packages_definition"
+        puts-step "Downloading NLTK packages: ${nltk_packages[*]}"
 
-        python -m nltk.downloader -d "$BUILD_DIR/.heroku/python/nltk_data" "$nltk_packages"  | indent
+        python -m nltk.downloader -d "$BUILD_DIR/.heroku/python/nltk_data" "${nltk_packages[@]}" | indent
         set_env NLTK_DATA "/app/.heroku/python/nltk_data"
 
     else
@@ -32,4 +32,3 @@ if sp-grep -s nltk; then
         puts-warn "Learn more: https://devcenter.heroku.com/articles/python-nltk"
     fi
 fi
-
diff --git a/test/fixtures/nltk/nltk.txt b/test/fixtures/nltk/nltk.txt
index c4f6bba8..1e578bfa 100644
--- a/test/fixtures/nltk/nltk.txt
+++ b/test/fixtures/nltk/nltk.txt
@@ -1 +1,2 @@
-wordnet
\ No newline at end of file
+city_database
+stopwords
diff --git a/test/run b/test/run
index 130b0e7b..3064ef91 100755
--- a/test/run
+++ b/test/run
@@ -24,7 +24,7 @@ testGEOS() {
 
 testNLTK() {
   compile "nltk"
-  assertCaptured "wordnet"
+  assertCaptured "Downloading NLTK packages: city_database stopwords"
   assertCapturedSuccess
 }
 
-- 
GitLab