Yatsko's Linguistic Informatics
Y - S e t s a p p l i c a t i o n
Y-Sets is an application for statistical text processing. In contrast to existing concordancers it allows the user performing intersection of two or more texts to find identical words and to get statistical information about their distribution. With the help of Y-Sets you can also get statistics about words that occur only in one text and don't appear in the other texts under analysis (differentiation function).
The application supports English and Russian and can process .txt files in UTF 8 encoding on Windows machines. Statistical information obtained with the help of Y-Sets can be further used for the purposes of text classification and authorship attribution, information retrieval, and plagiarism detection.
Y-Sets can function in two modes, single file processing and batch processing. To launch the application open the application's directory and double click Y-Sets.exe file.
Single file processing
Using single file processing you can open a file or import a previously created word list saved in .txt format. A word list is a ranked list, in which each word in a text is assigned its rank and frequency of occurrence. The list is sorted in descending order of word frequencies and ascending order of their ranks.
To open and process a file just click Create word list and choose Open/Add file(s) in the drop down list. Using Windows explorer find and open a file. The application will generate a word list with words, ranks, and frequencies. File data section displays information about the number of unique words and tokens.
You can modify the word list by applying Filter 1 and/or Filter 2. You can save the results exporting current word list or source word list.
Filter 1 allows getting statistics about the distribution of a word or words that the user has included in a file or entered manually in the bottom right field of the application. The other words will be ignored. For example, you are interested in obtaining statistics about the distribution in the given text of stop words that are available in a file. Click Filter 1, choose Add words from file(s), and open the file with these words. You will get the following statistics.
Current rank is the rank of a word in the current word list (after filtering); original rank is its rank in the source text. File data shows the number of unique stop words and the number of tokens. If you click Show filters menu item you can look up the words.
Alternatively, instead of loading a file you can enter the words manually and then choose Add and use words below option.
IMPORTANT: You cannot use file filtering and manual filtering simultaneously. Use Clear filters function before switching from one mode to the other one.
Note that the filter can be applied when only using the the current list of words. If you want to modify it, you should use Clear filters function. For example, you applied the filter to obtain statistics about the, a, and. Now you want to get statistical information about in. You should first apply Clear filters and then enter the new word/words and use the filter. That is, each time you modify the list with the words to be filtered you should first apply Clear filters and then use the modified list.
Filter 2 allows getting statistics about the distribution of a word or words in the text except those ones that the user has included in a file or entered manually in the bottom right field of the application. These words will be ignored. For example, you are interested in getting statistics about all words in the text except stop words. You choose a file with stop words or enter them manually, and they will be ignored by the application.
It's possible to use both filters at the same time adding some words to Filter 1 and some words to Filter 2.
The user can save the results by exporting them to a .txt, .csv, .xlsx file, or by copying them to the clipboard and then inserting to a table or math processor. You can export the current word list that was generated after filtering or the word list of the source text.
When you have finished processing one file, you can proceed to another file applying Clear all tools and files option. If you don't clear the results and add a new file the application will switch to batch processing mode
Batch processing
Batch processing involves processing and getting statistics about the distribution of words in two or more files. To process several files the user can add a file (some files) to an already opened file, or choose a directory using Open/Add directory from Create word list drop down list.
As soon as you add files, you will get the following statistics.
"Frequency" is sum of frequencies of occurrence in all the files added by the user. "Rank" is rank assigned depending on the frequency. The next columns give information about frequencies and ranks in specific files.
The user can apply Filter 1 and Filter 2 in the same way as they are applied to single files (see above). Note that to apply a filter you should fist open at least one text. Then the filter will be applied to all other texts added by the user.
Batch processing opens up an opportunity to perform intersection and differentiation procedures. With the help of intersection the user can find statistics about the words that are identical in all added files. By means of differentiation you can obtain information about the words that are specific to a given text document. To do intersection and differentiation click Intersect or Differentiate button.
You can undo intersection or differentiation using the corresponding buttons. You can save the results in the same way as the results of single file processing.
The Y-Sets application is distributed as freeware. Go to Downloads section to get it. The application folder contains User Guide with screenshots illustrating its functionality.