forager docs

This page contains more information about the key components of forager: data, the default policy for handling out-of-vocabulary words, switch methods, models, and outputs.

data

To use forager on one’s own data, the user needs to upload a single text/CSV file of fluency lists with two columns (with headers): one for the participant identifier and one for the item they produced. For example, if participant 1 had 30 response items, there would be 30 rows with “1” in the first column, one for each item in their fluency list. The rows should be separated by a newline, and the columns should be separated by the same delimiter throughout (such as a tab, space, comma). Most spreadsheet tools (e.g., Excel) can save files as a CSV, which can also be converted to a text file. Below is an example of a file that forager will be able to process:

  SID entry
  1 bat
  1 cat
  1 horse
  2 dog
  2 hamster

Please note: The current version of forager already contains the necessary lexical data to process English-language VFT data for the “animals” category and the web interface works for this category. However, those who wish to analyze data from a different category should use the Python package directly upload their own lexicon of acceptable words and corresponding semantic embeddings for that category and derive the necessary frequency and similarity data using the functions provided in the package or via the Colab interface.

Please note: If your file has a third column, forager will automatically assume that it is a timepoint, and will treat the data from a given participant separated by timepoints. For example, if the file has the following format:

    SID entry time
    1 bat 1
    1 cat 1 
    1 horse 2
    2 dog 1
    2 hamster 2
  
In this case, forager will treat the data from participant 1 as two separate lists, one for timepoint 1 and one for timepoint 2. If you do not want this behavior, please remove the time column from your file before uploading it to forager.

handling out-of-vocabulary (OOV) words

All search-related functions in forager rely on the lexical measures mentioned above. Therefore, the items in the fluency lists must be in the stored lexicon (i.e., in the files containing the embeddings, frequencies, and similarity matrices). An out-of-vocabulary (OOV) item is automatically replaced by a close match in the lexicon, if forager finds a reasonable replacement. forager classifies a replacement as reasonable if the Levenshtein edit distance between the OOV item and its closest match in the lexicon is two or less (e.g., horses would be replaced by horse). This policy allows correcting for minor variants, plurals, and spelling errors within the fluency lists.

If the edit distance between the OOV item and the closest match found in the lexicon is more than two, then forager provides three options to the user: exclude all occurrences of the OOV item, truncate the list(s) at the first occurrence of the OOV item, or replace the OOV item with a mean/random vector/label from the lexicon. This random vector (UNK) has a semantic vector that is the centroid of all other vectors in the lexicon and the phonological similarity of this label is the average phonological similarity across all other items in the vocabulary. The frequency of this random vector is set to 0.0001.

After providing the file containing the fluency lists and specifying an OOV policy, the web interface of forager runs through the data and provides the evaluation results in a .zip file, which contains three .csv files (see outputs for details).

switch methods

forager provides five different methods for determining clusters and switches in a fluency list:

  1. Norm-based, based on the hand-coded norms of animal subcategories created by Troyer et al. (1997; e.g., pets, aquatic animals, etc.) and subsequently extended by Lundin et al. (2022) and Zemla et al. (2020). There are two norm-based methods provided (based on Hills et al. 2015):
  2. Similarity-drop, based on the heuristic used by Hills et al. (2012), where a switch is predicted if there is a drop in semantic similarity between consecutive items followed by an immediate rise in semantic similarity
  3. Delta similarity, based on Lundin et al. (2022), where switches and clusters depend on whether a rise or drop in semantic similarity exceeds specific thresholds, and
  4. Multimodal similarity drop, where the similarity between consecutive items is a weighted sum of the semantic and phonological similarity, and switches correspond to drops in semantic-phonological similarity.
The file switch_results.csv will contain the item-level cluster/switch designations for each method (see outputs for details).

models

forager comes with several foraging models that can be fit to VFT data (static, dynamic, etc.). The models differ in their use of three lexical sources (semantic similarity, phonological similarity, and frequency) during cluster and switch transitions. Details of computational models are provided in Hills et al. (2012) and Kumar, Lundin, & Jones (2022), as well as in the package documentation, although we provide brief descriptions below. Users can run a single model, a subset of models, or all models for comparison. Each model will calculate the overall negative log-likelihood (NLL) of the data, as well as participant- and item-level NLLs. Lower NLLs indicate a better fit and model.

Note: If users wish to run foraging models via the web interface, they are directed to a Colab notebook, where they can upload their data and run the models.

  1. Static foraging model. The static model (Hills et al., 2012) uses semantic similarity and word frequency to calculate the probability of retrieving an item without consideration of transitions between clusters.
  2. Dynamic foraging model. The dynamic model (Hills et al., 2012) uses different cues to determine an item’s likelihood based on whether the item belongs to a cluster or signifies a switch event. For items within a cluster, the model is identical to the static model and uses semantic similarity and word frequency to make local transitions. When items are designated as switches, the likelihood is computed based on frequency alone.
  3. Phonology-based models. In addition to the classic foraging models described above, we also introduce and release a range of experimental models that explore the influence of phonology in local (within-cluster) and global (between-cluster) transitions. Specifically, we adapted the static and dynamic models from Hills et al. (2012) to incorporate phonological similarity cues, based on recent work by Kumar, Lundin, and Jones (2022). The static phonology model is identical to the static model above, except that the product of the frequencies and semantic similarities is also multiplied by phonological similarities. Like the frequency and semantic similarity cues, the phonological similarity is also weighted by a saliency parameter. The dynamic phonology model has an additional argument specifying the type of phonological cue used from the following options: “local,” “global,” or “switch.” The “local” dynamic model incorporates phonological similarity as an additional cue for within-cluster transitions. The “global” dynamic model incorporates phonological similarity in both switch and cluster transitions. Finally, the “switch” dynamic model computes the likelihood of an item based on phonological similarity and frequency for switch transitions and based on semantic similarity and frequency for cluster transitions.
The file model_results.csv will contain the model-based negative log-likelihoods for the selected models (see outputs for details).

outputs

Different output files are generated when forager is run, based on the use case:

Possible analyses with forager include finding the best-fitting model for a set of fluency lists, evaluating the performance of a specific model, obtaining metrics of semantic and/or phonological similarity as well as different cluster/switch designations, and comparing the model performance for different groups. Users are encouraged to read the paper for demonstrations of these analyses.

citation

If you use forager, please use the guidelines on the cite page to cite our work!