This page contains more information about the key components of forager: data, the default policy for handling out-of-vocabulary words, switch methods, models, and outputs.
To use forager on one’s own data, the user needs to upload a single text/CSV file of fluency lists with two columns (with headers): one for the participant identifier and one for the item they produced. For example, if participant 1 had 30 response items, there would be 30 rows with “1” in the first column, one for each item in their fluency list. The rows should be separated by a newline, and the columns should be separated by the same delimiter throughout (such as a tab, space, comma). Most spreadsheet tools (e.g., Excel) can save files as a CSV, which can also be converted to a text file. Below is an example of a file that forager will be able to process:
SID entry 1 bat 1 cat 1 horse 2 dog 2 hamster
Please note: The current version of forager already contains the necessary lexical data to process English-language VFT data for the “animals” category and the web interface works for this category. However, those who wish to analyze data from a different category should use the Python package directly upload their own lexicon of acceptable words and corresponding semantic embeddings for that category and derive the necessary frequency and similarity data using the functions provided in the package or via the Colab interface.
Please note: If your file has a third column, forager will automatically assume that it is a timepoint, and will treat the data from a given participant separated by timepoints. For example, if the file has the following format:
SID entry time 1 bat 1 1 cat 1 1 horse 2 2 dog 1 2 hamster 2In this case, forager will treat the data from participant 1 as two separate lists, one for timepoint 1 and one for timepoint 2. If you do not want this behavior, please remove the time column from your file before uploading it to forager.
All search-related functions in forager rely on the lexical measures mentioned above. Therefore, the items in the fluency lists must be in the stored lexicon (i.e., in the files containing the embeddings, frequencies, and similarity matrices). An out-of-vocabulary (OOV) item is automatically replaced by a close match in the lexicon, if forager finds a reasonable replacement. forager classifies a replacement as reasonable if the Levenshtein edit distance between the OOV item and its closest match in the lexicon is two or less (e.g., horses would be replaced by horse). This policy allows correcting for minor variants, plurals, and spelling errors within the fluency lists.
If the edit distance between the OOV item and the closest match found in the lexicon is more than two, then forager provides three options to the user: exclude all occurrences of the OOV item, truncate the list(s) at the first occurrence of the OOV item, or replace the OOV item with a mean/random vector/label from the lexicon. This random vector (UNK) has a semantic vector that is the centroid of all other vectors in the lexicon and the phonological similarity of this label is the average phonological similarity across all other items in the vocabulary. The frequency of this random vector is set to 0.0001.
After providing the file containing the fluency lists and specifying an OOV policy, the web interface of forager runs through
the data and provides the evaluation results in a .zip
file, which contains three .csv
files (see outputs for details).
forager provides five different methods for determining clusters and switches in a fluency list:
switch_results.csv
will contain the item-level cluster/switch designations for each method (see outputs for details).
forager comes with several foraging models that can be fit to VFT data (static, dynamic, etc.). The models differ in their use of three lexical sources (semantic similarity, phonological similarity, and frequency) during cluster and switch transitions. Details of computational models are provided in Hills et al. (2012) and Kumar, Lundin, & Jones (2022), as well as in the package documentation, although we provide brief descriptions below. Users can run a single model, a subset of models, or all models for comparison. Each model will calculate the overall negative log-likelihood (NLL) of the data, as well as participant- and item-level NLLs. Lower NLLs indicate a better fit and model.
Note: If users wish to run foraging models via the web interface, they are directed to a Colab notebook, where they can upload their data and run the models.model_results.csv
will contain the model-based negative log-likelihoods for the selected models (see outputs for details).
Different output files are generated when forager is run, based on the use case:
.zip
file:
evaluation_results.csv
: This file will contain the results from forager's evaluation of all the items in the data against its own vocabulary. The evaluation
column in the file will describe whether an exact match to the item was found (FOUND), or a reasonable replacement was made (REPLACE), and how the OOV items were handled based on the user-specified policy (EXCLUDE/TRUNCATE/UNK). The replacement
column in the file will describe the specific replacements made to the items.processed_data.csv
: This file will contain the final data that will be submitted for further analyses to forager. Users should carefully inspect this file to ensure that it aligns with their expectations, and make changes if needed.forager_vocab.csv
: This file contains the vocabulary used by forager to conduct the evaluation. We provide this file so that users can make changes to their data if needed before they proceed with other analyses..zip
file:
lexical_results.csv
: This file contains item-wise lexical metrics (semantic similarity, phonological similarity, and word frequency). The semantic and phonological similarities indicate the similarity between the previous item and current item (the first item will have an arbitrary value of .0001), whereas the frequency values indicate the frequency of the current item in the English language (obtained via Google N-grams).individual_descriptive_stats.csv
: This file contains some aggregate metrics at the participant level such as total number of items produced, as well as means/SDs of semantic similarity, phonological similarity, and word frequency..zip
file:
switch_results.csv
: This file contains the item-level cluster/switch designations for each method. A switch is indicated by a 1, and a cluster is indicated by a 0. A value of 2 either denotes the first item in the list or the last item(s) for switch methods that rely on previous/subsequent items (i.e., no switch/cluster prediction can be made).lexical_results.csv
: This file will be identical to the one generated in the Get Lexical Values option.individual_descriptive_stats.csv
: In addition to the metrics available from lexical results (mean/SD of lexical values and number of items), this file will also contain the total number of switches and mean cluster size for each switch method.aggregate_descriptive_stats.csv
: This file will contain mean/SD for the number of switches (aggregated across all participants) for each switch method..zip
file:
model_results.csv
: This file will contain the model-based negative log-likelihoods for the selected models (see models for details), as well as the best-fitting parameter values for semantic, phonological, and frequency cues for each model, at the subject level. A random baseline model is also included for comparison purposes. The random baseline model assigns an equal likelihood to every item in the fluency list.aggregate_descriptive_stats.csv
: In addition to the metrics available from the Get Switches option, this file will also contain the mean/SD values of the parameters (aggregated across all participants) for each model and switch method.If you use forager, please use the guidelines on the cite page to cite our work!