topik.intermediaries package¶
Submodules¶
topik.intermediaries.digested_document_collection module¶
topik.intermediaries.persistence module¶
This file handles the storage of data from loading and analysis.
More accurately, the files written and read from this file describe how to read/write actual data, such that the actual format of any data need not be tightly defined.
topik.intermediaries.raw_data module¶
This file is concerned with providing a simple interface for data stored in Elasticsearch. The class(es) defined here are fed into the preprocessing step.
-
class
topik.intermediaries.raw_data.
CorpusInterface
[source]¶ Bases:
object
-
append_to_record
(record_id, field_name, field_value)[source]¶ Used to store preprocessed output alongside input data.
Field name is destination. Value is processed value.
-
classmethod
class_key
()[source]¶ Implement this method to return the string ID with which to store your class.
-
filter_string
¶
-
get_generator_without_id
(field=None)[source]¶ Returns a generator that yields field content without doc_id associate
-
save
(filename, saved_data=None)[source]¶ Persist this object to disk somehow.
You can save your data in any number of files in any format, but at a minimum, you need one json file that describes enough to bootstrap the loading prcess. Namely, you must have a key called ‘class’ so that upon loading the output, the correct class can be instantiated and used to load any other data. You don’t have to implement anything for saved_data, but it is stored as a key next to ‘class’.
-
synchronize
(max_wait, field)[source]¶ By default, operations are synchronous and no additional wait is necessary. Data sources that are asynchronous (ElasticSearch) may use this function to wait for “eventual consistency”
-
tokenize
(method='simple', synchronous_wait=30, **kwargs)[source]¶ Convert data to lowercase; tokenize; create bag of words collection.
Output from this function is used as input to modeling steps.
- raw_data: iterable corpus object containing the text to be processed.
- Each iteration call should return a new document’s content.
- tokenizer_method: string id of tokenizer to use. For keys, see
- topik.tokenizers.tokenizer_methods (which is a dictionary of classes)
- kwargs: arbitrary dicionary of extra parameters. These are passed both
- to the tokenizer and to the vectorizer steps.
-
-
class
topik.intermediaries.raw_data.
DictionaryCorpus
(content_field, iterable=None, generate_id=True, reference_field=None, content_filter=None)[source]¶ Bases:
topik.intermediaries.raw_data.CorpusInterface
-
filter_string
¶
-
-
class
topik.intermediaries.raw_data.
ElasticSearchCorpus
(source, index, content_field, doc_type=None, query=None, iterable=None, filter_expression='', **kwargs)[source]¶ Bases:
topik.intermediaries.raw_data.CorpusInterface
-
filter_string
¶
-
get_field
(field=None)[source]¶ Get a different field to iterate over, keeping all other connection details.
-
import_from_iterable
(iterable, id_field='text', batch_size=500)[source]¶ Load data into Elasticsearch from iterable.
- iterable: generally a list of dicts, but possibly a list of strings
- This is your data. Your dictionary structure defines the schema of the elasticsearch index.
- id_field: string identifier of field to hash for content ID. For
- list of dicts, a valid key value in the dictionary is required. For list of strings, a dictionary with one key, “text” is created and used.
-