topik.intermediaries package¶

Submodules¶

topik.intermediaries.digested_document_collection module¶

topik.intermediaries.persistence module¶

This file handles the storage of data from loading and analysis.

More accurately, the files written and read from this file describe how to read/write actual data, such that the actual format of any data need not be tightly defined.

class topik.intermediaries.persistence.Persistor(filename=None)[source]¶

Bases: object

get_corpus_dict()[source]¶

get_model_details(model_id)[source]¶

list_available_models()[source]¶

load_data(filename)[source]¶

persist_data(filename)[source]¶

store_corpus(data_dict)[source]¶

store_model(model_id, model_dict)[source]¶

topik.intermediaries.raw_data module¶

This file is concerned with providing a simple interface for data stored in Elasticsearch. The class(es) defined here are fed into the preprocessing step.

class topik.intermediaries.raw_data.CorpusInterface[source]¶

Bases: object

append_to_record(record_id, field_name, field_value)[source]¶

Used to store preprocessed output alongside input data.

Field name is destination. Value is processed value.

classmethod class_key()[source]¶: Implement this method to return the string ID with which to store your class.

filter_string¶

get_date_filtered_data(start, end, field)[source]¶

get_generator_without_id(field=None)[source]¶: Returns a generator that yields field content without doc_id associate

save(filename, saved_data=None)[source]¶

Persist this object to disk somehow.

You can save your data in any number of files in any format, but at a minimum, you need one json file that describes enough to bootstrap the loading prcess. Namely, you must have a key called ‘class’ so that upon loading the output, the correct class can be instantiated and used to load any other data. You don’t have to implement anything for saved_data, but it is stored as a key next to ‘class’.

synchronize(max_wait, field)[source]¶: By default, operations are synchronous and no additional wait is necessary. Data sources that are asynchronous (ElasticSearch) may use this function to wait for “eventual consistency”

tokenize(method='simple', synchronous_wait=30, **kwargs)[source]¶

Convert data to lowercase; tokenize; create bag of words collection.

Output from this function is used as input to modeling steps.

raw_data: iterable corpus object containing the text to be processed.: Each iteration call should return a new document’s content.
tokenizer_method: string id of tokenizer to use. For keys, see: topik.tokenizers.tokenizer_methods (which is a dictionary of classes)
kwargs: arbitrary dicionary of extra parameters. These are passed both: to the tokenizer and to the vectorizer steps.

class topik.intermediaries.raw_data.DictionaryCorpus(content_field, iterable=None, generate_id=True, reference_field=None, content_filter=None)[source]¶

Bases: topik.intermediaries.raw_data.CorpusInterface

append_to_record(record_id, field_name, field_value)[source]¶

classmethod class_key()[source]¶

filter_string¶

get_date_filtered_data(start, end, field='year')[source]¶

get_field(field=None)[source]¶: Get a different field to iterate over, keeping all other details.

get_generator_without_id(field=None)[source]¶

import_from_iterable(iterable, content_field, generate_id=True)[source]¶

iterable: generally a list of dicts, but possibly a list of strings: This is your data. Your dictionary structure defines the schema of the elasticsearch index.

save(filename, saved_data=None)[source]¶

class topik.intermediaries.raw_data.ElasticSearchCorpus(source, index, content_field, doc_type=None, query=None, iterable=None, filter_expression='', **kwargs)[source]¶

Bases: topik.intermediaries.raw_data.CorpusInterface

append_to_record(record_id, field_name, field_value)[source]¶

classmethod class_key()[source]¶

convert_date_field_and_reindex(field)[source]¶

filter_string¶

get_date_filtered_data(start, end, field='date')[source]¶

get_field(field=None)[source]¶: Get a different field to iterate over, keeping all other connection details.

get_generator_without_id(field=None)[source]¶

import_from_iterable(iterable, id_field='text', batch_size=500)[source]¶

Load data into Elasticsearch from iterable.

iterable: generally a list of dicts, but possibly a list of strings: This is your data. Your dictionary structure defines the schema of the elasticsearch index.
id_field: string identifier of field to hash for content ID. For: list of dicts, a valid key value in the dictionary is required. For list of strings, a dictionary with one key, “text” is created and used.

save(filename, saved_data=None)[source]¶

synchronize(max_wait, field)[source]¶

topik.intermediaries.raw_data.load_persisted_corpus(filename)[source]¶

topik.intermediaries.raw_data.register_output(cls)[source]¶

topik.intermediaries package¶

Submodules¶

topik.intermediaries.digested_document_collection module¶

topik.intermediaries.persistence module¶

topik.intermediaries.raw_data module¶

Module contents¶