I discovered non-printing (aka 'zero-width', aka 'invisible') characters in many posts when trying to export all post/page content in JSON format using the Tools > Export function. Some time ago Firefox decided to make itself the default handler for JSON files and so i let it open the file, except it couldn't. I kept getting errors about stuff being "malformed". Eventually i figured out it was zero-width chars that was breaking it. Most/all of these characters were contained in excerpts i copied from other websites.
What all these "invisible" chars are used for, i don't know, but it seems they may be able to be used for tracking purposes, or possibly copyright stuff, or to hide content, or they can be perfectly benign and perhaps necessary in some cases.
Recently Eric prompted me to look into this again in the context of the Right Dao search engine. This led me to re-investigate the non-printing chars on my site, so thanks Eric!
The non-printing character that was breaking Firefox's rendering of the JSON file was \0
and so, the SQL dummy i am, i simply used phpMyAdmin to replace them with empty strings. Here's the steps i took:
- backup your database!
- in phpMyAdmin, select your WP/CP database, then the posts table
- in the top menu bar, select Search, then Find and Replace
- for Find, enter:
\0
- for the column, select
post_content
- make sure the regular expression option is NOT selected
- click Go and if all looks OK, click the Replace button
You can repeat the above, this time searching for the �
character, and then the the ^@
character.
If you search for other zero-width chars in your post_content
column you might find some, but before removing them you should investigate carefully. In addition to \0
and �
, i also searched the post_content
column for ^@
but didn't find any.
You can probably safety do this replacement for the comment_content
column as well, selecting the comment_content
column in place of post_content
.
It appears there are many zero-widths in the *_meta
columns and they should probably not be removed!
If you want to see if a file contains zero-width chars, you can use this:
$ grep -E '[^[:print:]]' --color='never' '<some file>' | cat -vET
Or perhaps this:
$ pcre2grep --only-matching '([^[:print:]])' '<some file>' | cat -vET
Again, don't remove the non-printing chars from the *_meta
columns and be very careful if you remove them from any tables other than the posts or comments tables.