topik.fileio package¶
Subpackages¶
Submodules¶
topik.fileio.base_output module¶
-
class
topik.fileio.base_output.
OutputInterface
(*args, **kwargs)[source]¶ Bases:
object
-
save
(filename, saved_data=None)[source]¶ Persist this object to disk somehow.
You can save your data in any number of files in any format, but at a minimum, you need one json file that describes enough to bootstrap the loading process. Namely, you must have a key called ‘class’ so that upon loading the output, the correct class can be instantiated and used to load any other data. You don’t have to implement anything for saved_data, but it is stored as a key next to ‘class’.
-
topik.fileio.in_document_folder module¶
-
topik.fileio.in_document_folder.
read_document_folder
(folder, content_field='text')[source]¶ Iterate over the files in a folder to retrieve the content to process and tokenize.
Parameters: folder : str
The folder containing the files you want to analyze.
content_field : str
The usage of ‘content_field’ in this source is different from most other sources. The assumption in this source is that each file contains raw text, NOT dictionaries of categorized data. The content_field argument here specifies what key to store the raw text under in the returned dictionary for each document.
Examples
>>> documents = read_document_folder( ... '{}/test_data_folder_files'.format(test_data_path)) >>> next(documents)['text'] == ( ... u"'Interstellar' was incredible. The visuals, the score, " + ... u"the acting, were all amazing. The plot is definitely one " + ... u"of the most original I've seen in a while.") True
topik.fileio.in_elastic module¶
-
topik.fileio.in_elastic.
read_elastic
(hosts, **kwargs)[source]¶ Iterate over all documents in the specified elasticsearch intance and index that match the specified query.
kwargs are passed to Elasticsearch class instantiation, and can be used to pass any additional options described at https://elasticsearch-py.readthedocs.org/en/master/
Parameters: hosts : str or list
Address of the elasticsearch instance any index. May include port, username and password. See https://elasticsearch-py.readthedocs.org/en/master/api.html#elasticsearch for all options.
content_field : str
The name fo the field that contains the main text body of the document.
**kwargs: additional keyword arguments to be passed to Elasticsearch client instance and to scan query.
See https://elasticsearch-py.readthedocs.org/en/master/api.html#elasticsearch for all client options. https://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.scan for all scan options.
topik.fileio.in_json module¶
-
topik.fileio.in_json.
read_json_stream
(filename, json_prefix='item', **kwargs)[source]¶ Iterate over a json stream of items and get the field that contains the text to process and tokenize.
Parameters: filename : str
The filename of the json stream.
Examples
>>> documents = read_json_stream( ... '{}/test_data_json_stream.json'.format(test_data_path)) >>> next(documents) == { ... u'doi': u'http://dx.doi.org/10.1557/PROC-879-Z3.3', ... u'title': u'Sol Gel Preparation of Ta2O5 Nanorods Using DNA as Structure Directing Agent', ... u'url': u'http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8081671&fulltextType=RA&fileId=S1946427400119281.html', ... u'abstract': u'Transition metal oxides are being considered as the next generation materials in field such as electronics and advanced catalysts; between them is Tantalum (V) Oxide; however, there are few reports for the synthesis of this material at the nanometer size which could have unusual properties. Hence, in this work we present the synthesis of Ta2O5 nanorods by sol gel method using DNA as structure directing agent, the size of the nanorods was of the order of 40 to 100 nm in diameter and several microns in length; this easy method can be useful in the preparation of nanomaterials for electronics, biomedical applications as well as catalysts.', ... u'filepath': u'abstracts/879/http%3A%2F%2Fjournals.cambridge.org%2Faction%2FdisplayAbstract%3FfromPage%3Donline%26aid%3D8081671%26fulltextType%3DRA%26fileId%3DS1946427400119281.html', ... u'filename': '{}/test_data_json_stream.json'.format(test_data_path), ... u'vol': u'879', ... u'authors': [u'Humberto A. Monreala', u' Alberto M. Villafañe', ... u' José G. Chacón', u' Perla E. García', ... u'Carlos A. Martínez'], ... u'year': u'1917'} True
-
topik.fileio.in_json.
read_large_json
(filename, json_prefix='item', **kwargs)[source]¶ Iterate over all items and sub-items in a json object that match the specified prefix
Parameters: filename : str
The filename of the large json file
json_prefix : str
The string representation of the hierarchical prefix where the items of interest may be located within the larger json object.
Try the following script if you need help determining the desired prefix: $ import ijson $ with open(‘test_data_large_json_2.json’, ‘r’) as f: $ parser = ijson.parse(f) $ for prefix, event, value in parser: $ print(“prefix = ‘%r’ || event = ‘%r’ || value = ‘%r’” % $ (prefix, event, value))
Examples
>>> documents = read_large_json( ... '{}/test_data_large_json.json'.format(test_data_path), ... json_prefix='item._source.isAuthorOf') >>> next(documents) == { ... u'a': u'ScholarlyArticle', ... u'name': u'Path planning and formation control via potential function for UAV Quadrotor', ... u'author': [ ... u'http://dig.isi.edu/autonomy/data/author/a.a.a.rizqi', ... u'http://dig.isi.edu/autonomy/data/author/t.b.adji', ... u'http://dig.isi.edu/autonomy/data/author/a.i.cahyadi'], ... u'text': u"Potential-function-based control strategy for path planning and formation " + ... u"control of Quadrotors is proposed in this work. The potential function is " + ... u"used to attract the Quadrotor to the goal location as well as avoiding the " + ... u"obstacle. The algorithm to solve the so called local minima problem by utilizing " + ... u"the wall-following behavior is also explained. The resulted path planning via " + ... u"potential function strategy is then used to design formation control algorithm. " + ... u"Using the hybrid virtual leader and behavioral approach schema, the formation " + ... u"control strategy by means of potential function is proposed. The overall strategy " + ... u"has been successfully applied to the Quadrotor's model of Parrot AR Drone 2.0 in " + ... u"Gazebo simulator programmed using Robot Operating System.\nAuthor(s) Rizqi, A.A.A. " + ... u"Dept. of Electr. Eng. & Inf. Technol., Univ. Gadjah Mada, Yogyakarta, Indonesia " + ... u"Cahyadi, A.I. ; Adji, T.B.\nReferenced Items are not available for this document.\n" + ... u"No versions found for this document.\nStandards Dictionary Terms are available to " + ... u"subscribers only.", ... u'uri': u'http://dig.isi.edu/autonomy/data/article/6871517', ... u'datePublished': u'2014', ... 'filename': '{}/test_data_large_json.json'.format(test_data_path)} True
topik.fileio.out_elastic module¶
-
class
topik.fileio.out_elastic.
BaseElasticCorpora
(instance, index, corpus_type, query=None, batch_size=1000)[source]¶ Bases:
UserDict.UserDict
-
class
topik.fileio.out_elastic.
ElasticSearchOutput
(source, index, hash_field=None, doc_type='continuum', query=None, iterable=None, filter_expression='', vectorized_corpora=None, tokenized_corpora=None, modeled_corpora=None, **kwargs)[source]¶ Bases:
topik.fileio.base_output.OutputInterface
-
filter_string
¶
-
import_from_iterable
(iterable, field_to_hash='text', batch_size=500)[source]¶ Load data into Elasticsearch from iterable.
- iterable: generally a list of dicts, but possibly a list of strings
- This is your data. Your dictionary structure defines the schema of the elasticsearch index.
- field_to_hash: string identifier of field to hash for content ID. For
- list of dicts, a valid key value in the dictionary is required. For list of strings, a dictionary with one key, “text” is created and used.
-
-
class
topik.fileio.out_elastic.
ModeledElasticCorpora
(instance, index, corpus_type, query=None, batch_size=1000)[source]¶
topik.fileio.out_memory module¶
-
class
topik.fileio.out_memory.
GreedyDict
(dict=None, **kwargs)[source]¶ Bases:
UserDict.UserDict
,object
-
class
topik.fileio.out_memory.
InMemoryOutput
(iterable=None, hash_field=None, tokenized_corpora=None, vectorized_corpora=None, modeled_corpora=None)[source]¶
topik.fileio.project module¶
-
class
topik.fileio.project.
TopikProject
(project_name, output_type=None, output_args=None, **kwargs)[source]¶ Bases:
object
-
read_input
(source, content_field, source_type='auto', **kwargs)[source]¶ Import data from external source into Topik’s internal format
-
run_model
(model_name='lda', ntopics=3, **kwargs)[source]¶ Analyze vectorized text; determine topics and assign document probabilities
-
select_modeled_corpus
(_id)[source]¶ When more than one model output available (ran modeling more than once with different methods), this allows you to switch to a different data set.
-
select_tokenized_corpus
(_id)[source]¶ Assign active tokenized corpus.
When more than one tokenized corpus available (ran tokenization more than once with different methods), this allows you to switch to a different data set.
-
select_vectorized_corpus
(_id)[source]¶ Assign active vectorized corpus.
When more than one vectorized corpus available (ran tokenization more than once with different methods), this allows you to switch to a different data set.
-
selected_filtered_corpus
¶ Corpus documents, potentially a subset.
Output from read_input step. Input to tokenization step.
-
selected_modeled_corpus
¶ matrices representing the model derived.
Output from modeling step. Input to visualization step.
-
selected_tokenized_corpus
¶ Documents broken into component words. May also be transformed.
Output from tokenization and/or transformation steps. Input to vectorization step.
-
selected_vectorized_corpus
¶ Data that has been vectorized into term frequencies, TF/IDF, or other vector representation.
Output from vectorization step. Input to modeling step.
-
tokenize
(method='simple', **kwargs)[source]¶ Break raw text into substituent terms (or collections of terms)
-
topik.fileio.reader module¶
-
topik.fileio.reader.
read_input
(source, source_type='auto', folder_content_field='text', **kwargs)[source]¶ Read data from given source into Topik’s internal data structures.
Parameters: source : str
input data. Can be file path, directory, or server address.
source_type : str
“auto” tries to figure out data type of source. Can be manually specified instead. options for manual specification are [‘solr’, ‘elastic’, ‘json_stream’, ‘large_json’, ‘folder’]
folder_content_field : str
Only used for document_folder source. This argument is used as the key (field name), where each document represents the value of that field.
kwargs : any other arguments to pass to input parsers
Returns: iterable output object
>> ids, texts = zip(*list(iter(raw_data)))
Examples
>>> loaded_corpus = read_input(
... ‘{}/test_data_json_stream.json’.format(test_data_path))
>>> solution_text = (
... u’Transition metal oxides are being considered as the next generation ‘+
... u’materials in field such as electronics and advanced catalysts; ‘+
... u’between them is Tantalum (V) Oxide; however, there are few reports ‘+
... u’for the synthesis of this material at the nanometer size which could ‘+
... u’have unusual properties. Hence, in this work we present the ‘+
... u’synthesis of Ta2O5 nanorods by sol gel method using DNA as structure ‘+
... u’directing agent, the size of the nanorods was of the order of 40 to ‘+
... u‘100 nm in diameter and several microns in length; this easy method ‘+
... u’can be useful in the preparation of nanomaterials for electronics, ‘+
... u’biomedical applications as well as catalysts.’)
>>> solution_text == next(loaded_corpus)['abstract']
True