Replace non-printing (zero-width) characters in WordPress/ClassicPress posts

I discovered non-printing (aka 'zero-width', aka 'invisible') characters in many posts when trying to export all post/page content in JSON format using the Tools > Export function. Some time ago Firefox decided to make itself the default handler for JSON files and so i let it open the file, except it couldn't. I kept getting errors about stuff being "malformed". Eventually i figured out it was zero-width chars that was breaking it. Most/all of these characters were contained in excerpts i copied from other websites.

What all these "invisible" chars are used for, i don't know, but it seems they may be able to be used for tracking purposes, or possibly copyright stuff, or to hide content, or they can be perfectly benign and perhaps necessary in some cases.

Recently Eric prompted me to look into this again in the context of the Right Dao search engine. This led me to re-investigate the non-printing chars on my site, so thanks Eric!

The non-printing character that was breaking Firefox's rendering of the JSON file was \0 and so, the SQL dummy i am, i simply used phpMyAdmin to replace them with empty strings. Here's the steps i took:

  1. backup your database!
  2. in phpMyAdmin, select your WP/CP database, then the posts table
  3. in the top menu bar, select Search, then Find and Replace
  4. for Find, enter: \0
  5. for the column, select post_content
  6. make sure the regular expression option is NOT selected
  7. click Go and if all looks OK, click the Replace button

You can repeat the above, this time searching for the character, and then the the ^@ character.

If you search for other zero-width chars in your post_content column you might find some, but before removing them you should investigate carefully. In addition to \0 and , i also searched the post_content column for ^@ but didn't find any.

You can probably safety do this replacement for the comment_content column as well, selecting the comment_content column in place of post_content.

It appears there are many zero-widths in the *_meta columns and they should probably not be removed!

If you want to see if a file contains zero-width chars, you can use this:

$ grep -E '[^[:print:]]' --color='never' '<some file>' | cat -vET

Or perhaps this:

$ pcre2grep --only-matching '([^[:print:]])' '<some file>' | cat -vET

Again, don't remove the non-printing chars from the *_meta columns and be very careful if you remove them from any tables other than the posts or comments tables.

Leave a Reply

Your email address will not be published. Required fields are marked *