(MT) data cleanup analysis
For both the recommender system ("recys") as well as for the MT, we need to operate on clean data. Otherwise, it's garbage in->garbage out.
-
take a sample of 100 ContentItems.{content, title, summary}fields. Look at them. Identify weird data (example:⁚repeated strings) -
Put them into categories of artefacts -
Decide which ones to keep and which ones to clean up -
Think again about what to clean up with respect to the recsys -
Implement it as a python library -
run it on a DEV repco instance