forager docs

This page contains more information about the key components of forager: data, the default policy for handling out-of-vocabulary words, switch methods, models, and outputs.

data

To use forager on one’s own data, the user needs to upload a single text/CSV file of fluency lists with two columns (with headers): one for the participant identifier and one for the item they produced. For example, if participant 1 had 30 response items, there would be 30 rows with “1” in the first column, one for each item in their fluency list. The rows should be separated by a newline, and the columns should be separated by the same delimiter throughout (such as a tab, space, comma). Most spreadsheet tools (e.g., Excel) can save files as a CSV, which can also be converted to a text file. Below is an example of a file that forager will be able to process:

  SID entry
  1 bat
  1 cat
  1 horse
  2 dog
  2 hamster

Please note: The current version of forager already contains the necessary lexical data to process English-language VFT data for the “animals” category and the web interface works for this category. However, those who wish to analyze data from a different category should use the Python package directly upload their own lexicon of acceptable words and corresponding semantic embeddings for that category and derive the necessary frequency and similarity data using the functions provided in the package or via the Colab interface.

Please note: If your file has a third column, forager will automatically assume that it is a timepoint, and will treat the data from a given participant separated by timepoints. For example, if the file has the following format:

    SID entry time
    1 bat 1
    1 cat 1 
    1 horse 2
    2 dog 1
    2 hamster 2

In this case, forager will treat the data from participant 1 as two separate lists, one for timepoint 1 and one for timepoint 2. If you do not want this behavior, please remove the time column from your file before uploading it to forager.

handling out-of-vocabulary (OOV) words

All search-related functions in forager rely on the lexical measures mentioned above. Therefore, the items in the fluency lists must be in the stored lexicon (i.e., in the files containing the embeddings, frequencies, and similarity matrices). An out-of-vocabulary (OOV) item is automatically replaced by a close match in the lexicon, if forager finds a reasonable replacement. forager classifies a replacement as reasonable if the Levenshtein edit distance between the OOV item and its closest match in the lexicon is two or less (e.g., horses would be replaced by horse). This policy allows correcting for minor variants, plurals, and spelling errors within the fluency lists.

If the edit distance between the OOV item and the closest match found in the lexicon is more than two, then forager provides three options to the user: exclude all occurrences of the OOV item, truncate the list(s) at the first occurrence of the OOV item, or replace the OOV item with a mean/random vector/label from the lexicon. This random vector (UNK) has a semantic vector that is the centroid of all other vectors in the lexicon and the phonological similarity of this label is the average phonological similarity across all other items in the vocabulary. The frequency of this random vector is set to 0.0001.

After providing the file containing the fluency lists and specifying an OOV policy, the web interface of forager runs through the data and provides the evaluation results in a .zip file, which contains three .csv files (see outputs for details).

switch methods

forager provides five different methods for determining clusters and switches in a fluency list:

Norm-based, based on the hand-coded norms of animal subcategories created by Troyer et al. (1997; e.g., pets, aquatic animals, etc.) and subsequently extended by Lundin et al. (2022) and Zemla et al. (2020). There are two norm-based methods provided (based on Hills et al. 2015):

Norm-based associative search: This method assigns switches based on consecutive shifts from one category to another.
Norm-based categorical search: This method assigns switches based on overall shifts from one category to another

Similarity-drop, based on the heuristic used by Hills et al. (2012), where a switch is predicted if there is a drop in semantic similarity between consecutive items followed by an immediate rise in semantic similarity
Delta similarity, based on Lundin et al. (2022), where switches and clusters depend on whether a rise or drop in semantic similarity exceeds specific thresholds, and
Multimodal similarity drop, where the similarity between consecutive items is a weighted sum of the semantic and phonological similarity, and switches correspond to drops in semantic-phonological similarity.

The file switch_results.csv will contain the item-level cluster/switch designations for each method (see outputs for details).

models

forager comes with several foraging models that can be fit to VFT data (static, dynamic, etc.). The models differ in their use of three lexical sources (semantic similarity, phonological similarity, and frequency) during cluster and switch transitions. Details of computational models are provided in Hills et al. (2012) and Kumar, Lundin, & Jones (2022), as well as in the package documentation, although we provide brief descriptions below. Users can run a single model, a subset of models, or all models for comparison. Each model will calculate the overall negative log-likelihood (NLL) of the data, as well as participant- and item-level NLLs. Lower NLLs indicate a better fit and model.

Note: If users wish to run foraging models via the web interface, they are directed to a Colab notebook, where they can upload their data and run the models.

Static foraging model. The static model (Hills et al., 2012) uses semantic similarity and word frequency to calculate the probability of retrieving an item without consideration of transitions between clusters.
Dynamic foraging model. The dynamic model (Hills et al., 2012) uses different cues to determine an item’s likelihood based on whether the item belongs to a cluster or signifies a switch event. For items within a cluster, the model is identical to the static model and uses semantic similarity and word frequency to make local transitions. When items are designated as switches, the likelihood is computed based on frequency alone.
Phonology-based models. In addition to the classic foraging models described above, we also introduce and release a range of experimental models that explore the influence of phonology in local (within-cluster) and global (between-cluster) transitions. Specifically, we adapted the static and dynamic models from Hills et al. (2012) to incorporate phonological similarity cues, based on recent work by Kumar, Lundin, and Jones (2022). The static phonology model is identical to the static model above, except that the product of the frequencies and semantic similarities is also multiplied by phonological similarities. Like the frequency and semantic similarity cues, the phonological similarity is also weighted by a saliency parameter. The dynamic phonology model has an additional argument specifying the type of phonological cue used from the following options: “local,” “global,” or “switch.” The “local” dynamic model incorporates phonological similarity as an additional cue for within-cluster transitions. The “global” dynamic model incorporates phonological similarity in both switch and cluster transitions. Finally, the “switch” dynamic model computes the likelihood of an item based on phonological similarity and frequency for switch transitions and based on semantic similarity and frequency for cluster transitions.

The file model_results.csv will contain the model-based negative log-likelihoods for the selected models (see outputs for details).

outputs

Different output files are generated when forager is run, based on the use case:

Evaluate/Check Data: If the user first evaluates the data they wish to analyze, the following files will be available for download via a .zip file:
- evaluation_results.csv: This file will contain the results from forager's evaluation of all the items in the data against its own vocabulary. The evaluation column in the file will describe whether an exact match to the item was found (FOUND), or a reasonable replacement was made (REPLACE), and how the OOV items were handled based on the user-specified policy (EXCLUDE/TRUNCATE/UNK). The replacement column in the file will describe the specific replacements made to the items.
- processed_data.csv: This file will contain the final data that will be submitted for further analyses to forager. Users should carefully inspect this file to ensure that it aligns with their expectations, and make changes if needed.
- forager_vocab.csv: This file contains the vocabulary used by forager to conduct the evaluation. We provide this file so that users can make changes to their data if needed before they proceed with other analyses.
Get Lexical Values: If the user selects this option, the following files will be available for download via a .zip file:
- lexical_results.csv: This file contains item-wise lexical metrics (semantic similarity, phonological similarity, and word frequency). The semantic and phonological similarities indicate the similarity between the previous item and current item (the first item will have an arbitrary value of .0001), whereas the frequency values indicate the frequency of the current item in the English language (obtained via Google N-grams).
- individual_descriptive_stats.csv: This file contains some aggregate metrics at the participant level such as total number of items produced, as well as means/SDs of semantic similarity, phonological similarity, and word frequency.
- The three files from the evaluation phase (see handling OOV words for details).
Get Switches: If the user selects this option, the following files will be available for download via a .zip file:
- switch_results.csv: This file contains the item-level cluster/switch designations for each method. A switch is indicated by a 1, and a cluster is indicated by a 0. A value of 2 either denotes the first item in the list or the last item(s) for switch methods that rely on previous/subsequent items (i.e., no switch/cluster prediction can be made).
- lexical_results.csv: This file will be identical to the one generated in the Get Lexical Values option.
- individual_descriptive_stats.csv: In addition to the metrics available from lexical results (mean/SD of lexical values and number of items), this file will also contain the total number of switches and mean cluster size for each switch method.
- aggregate_descriptive_stats.csv: This file will contain mean/SD for the number of switches (aggregated across all participants) for each switch method.
- The three files from the evaluation phase (see handling OOV words for details).
Get Models: If the user selects this option, they will be redirected to a Colab notebook, where they can upload their data and run the models. In addition to the files generated in the Get Switches option, the following files will be available for download via a .zip file:
- model_results.csv: This file will contain the model-based negative log-likelihoods for the selected models (see models for details), as well as the best-fitting parameter values for semantic, phonological, and frequency cues for each model, at the subject level. A random baseline model is also included for comparison purposes. The random baseline model assigns an equal likelihood to every item in the fluency list.
- aggregate_descriptive_stats.csv: In addition to the metrics available from the Get Switches option, this file will also contain the mean/SD values of the parameters (aggregated across all participants) for each model and switch method.

Possible analyses with forager include finding the best-fitting model for a set of fluency lists, evaluating the performance of a specific model, obtaining metrics of semantic and/or phonological similarity as well as different cluster/switch designations, and comparing the model performance for different groups. Users are encouraged to read the paper for demonstrations of these analyses.

citation

If you use forager, please use the guidelines on the cite page to cite our work!