NANCY OTERO
  • Home
  • Blog
    • Creature: Machine Learning for Human Learning
    • Final Project Blog
    • Education and AI
    • AI without CS or Math
    • Human and AI learning
  • RESUME
  • Contact
Picture

New data & infrastructure: the Infinite Pipeline

4/5/2019

0 Comments

 
This week I used my new data, 70,000 (from the 300,0000) articles from Instructables to play with ELMO, TF-IDF and perform a Cosine Similarity. The results were much better! While making my data "trainable" I read a lot of articles that explain how cleaning and massaging the data can be 80% of the work in ML. This was a great lesson.

Overall process:
  1. Understand the problem (see previous post)
  2. Decide what data is needed (see previous post)
  3. Research what data is actually available (see previous post)
  4. Get the data (see previous post)
  5. Understand the data
  6. Select labels
  7. Clean the data (separate or remove urls, remove non-useful signs, NaNs, etc)
  8. Preprocesses it (convert it to the right format, in this case)
  9. Design a dataset for it
  10. Load it and store it 
Picture
I used tf.transform, tf.record and tf.example for the last three steps (preprocessing , design a dataset, load it and store it). I picked tf.transform and tf.record (tf.example is part of tf.record), because I wanted to learn something that was easily scalable, could integrate with my colab notebook, allow for parallel processing and monitoring. ​
Picture
Picture
Picture
0 Comments



Leave a Reply.

    AI without CS or MathAI without CS or Math

    Human and AI learning 

    Education and AIEducation and AI

Proudly powered by Weebly
  • Home
  • Blog
    • Creature: Machine Learning for Human Learning
    • Final Project Blog
    • Education and AI
    • AI without CS or Math
    • Human and AI learning
  • RESUME
  • Contact