(MT) data cleanup analysis

For both the recommender system ("recys") as well as for the MT, we need to operate on clean data. Otherwise, it's garbage in->garbage out.

  • take a sample of 100 ContentItems.{content, title, summary} fields. Look at them. Identify weird data (example: &#8282 repeated strings)
  • Put them into categories of artefacts
  • Decide which ones to keep and which ones to clean up
  • Think again about what to clean up with respect to the recsys
  • Implement it as a python library
  • run it on a DEV repco instance